Skip to main content

2025-06-02-12-08

Lessons Learned: A Multi-Agent Framework for Code LLMs to Learn and Improve

Abstract

arXiv:2505.23946v1 Announce Type: new Abstract: Recent studies show that LLMs possess different skills and specialize in different tasks. In fact, we observe that their varied performance occur in several levels of granularity. For example, in the code optimization task, code LLMs excel at different optimization categories and no one dominates others. This observation prompts the question of how one leverages multiple LLM agents to solve a coding problem without knowing their complementary strengths a priori. We argue that a team of agents can learn from each other's successes and failures so as to improve their own performance. Thus, a lesson is the knowledge produced by an agent and passed on to other agents in the collective solution process. We propose a lesson-based collaboration framework, design the lesson solicitation--banking--selection mechanism, and demonstrate that a team of small LLMs with lessons learned can outperform a much larger LLM and other multi-LLM collaboration methods.

摘要

近期研究表明,大型语言模型(LLMs)具备不同技能并擅长不同任务。事实上,我们观察到其性能差异存在于多个粒度层级。例如在代码优化任务中,代码LLMs在不同优化类别上各有所长,没有单一模型能全面占优。这一现象引发了一个问题:如何在未知模型互补优势的前提下,利用多个LLM智能体协同解决编码问题。我们认为,智能体团队可以通过相互学习成功与失败经验来提升个体性能。因此,我们将"经验"定义为智能体在集体求解过程中产生并传递给其他成员的知识。本研究提出基于经验协作的框架,设计经验征集-存储-选择机制,并证明通过经验共享的小型LLM团队,其性能可超越单个大型LLM及其他多LLM协作方法。


EmbAdvisor: Adaptive Cache Management for Sustainable LLM Serving

Abstract

arXiv:2505.23970v1 Announce Type: new Abstract: As large language models (LLMs) become widely used, their environmental impact\unicode{x2014}especially carbon emissions\unicode{x2014}has attracted more attention. Prior studies focus on compute-related carbon emissions. In this paper, we find that storage is another key contributor. LLM caching, which saves and reuses KV caches for repeated context, reduces operational carbon by avoiding redundant computation. However, this benefit comes at the cost of embodied carbon from high-capacity, high-speed SSDs. As LLMs scale, the embodied carbon of storage grows significantly. To address this tradeoff, we present EmbAdvisor, a carbon-aware caching framework that selects the optimal cache size for LLM serving. EmbAdvisor profiles different LLM tasks and uses an Integer Linear Programming (ILP) solver to select cache sizes that meet SLOs while minimizing total carbon emissions. Overall, EmbAdvisor reduces the average carbon emissions of a Llama-3 70B model by 9.5% under various carbon intensities compared to a non-adaptive cache scenario, and can save up to 31.2% when the carbon intensity is low.

摘要

随着大语言模型(LLMs)的广泛应用,其环境影响——尤其是碳排放问题——日益受到关注。现有研究主要关注计算相关的碳排放,本文发现存储同样是关键因素。LLM缓存技术通过保存并复用重复上下文的KV缓存来避免冗余计算,从而降低运行碳排放,但这一优势需以高容量高速固态硬盘的隐含碳排放为代价。随着LLM规模扩大,存储设备的隐含碳排放显著增长。为平衡这一矛盾,我们提出EmbAdvisor——一个碳感知缓存框架,可为LLM服务选择最优缓存规模。该框架通过分析不同LLM任务特征,运用整数线性规划(ILP)求解器在满足服务等级目标(SLO)的同时最小化总碳排放。实验表明,相较于非自适应缓存方案,EmbAdvisor能使Llama-3 70B模型在各种碳强度下的平均碳排放降低9.5%,在低碳强度场景下最高可减少31.2%的碳排放。


SkyLB: A Locality-Aware Cross-Region Load Balancer for LLM Inference

Abstract

arXiv:2505.24095v1 Announce Type: new Abstract: Serving Large Language Models (LLMs) efficiently in multi-region setups remains a challenge. Due to cost and GPU availability concerns, providers typically deploy LLMs in multiple regions using instance with long-term commitments, like reserved instances or on-premise clusters, which are often underutilized due to their region-local traffic handling and diurnal traffic variance. In this paper, we introduce SkyLB, a locality-aware multi-region load balancer for LLM inference that aggregates regional diurnal patterns through cross-region traffic handling. By doing so, SkyLB enables providers to reserve instances based on expected global demand, rather than peak demand in each individual region. Meanwhile, SkyLB preserves KV-Cache locality and a balanced load, ensuring cost efficiency without sacrificing performance. SkyLB achieves this with a cache-aware cross-region traffic handler and a selective pushing load balancing mechanism based on checking pending requests. Our evaluation on real-world workloads shows that it achieves 1.12-2.06x higher throughput and 1.74-6.30x lower latency compared to existing load balancers, while reducing total serving cost by 25%.

摘要

在多区域部署中高效服务大型语言模型(LLM)仍面临挑战。出于成本和GPU可用性考虑,提供商通常使用长期承诺实例(如预留实例或本地集群)在多区域部署LLM,这些实例由于仅处理区域本地流量和昼夜流量波动而经常利用率不足。本文提出SkyLB——一种面向LLM推理的感知局部性多区域负载均衡器,通过跨区域流量处理聚合区域昼夜模式。这使得提供商可以根据预期全球需求而非单个区域峰值需求来预留实例。同时,SkyLB保持KV缓存局部性和均衡负载,在保证性能前提下实现成本效益。其核心技术包括缓存感知的跨区域流量处理器和基于待处理请求检查的选择性推送负载均衡机制。实际工作负载评估表明,相比现有负载均衡器,SkyLB实现了1.12-2.06倍的吞吐量提升和1.74-6.30倍的延迟降低,同时将总服务成本降低25%。


MSQA: Benchmarking LLMs on Graduate-Level Materials Science Reasoning and Knowledge

Abstract

arXiv:2505.23982v1 Announce Type: new Abstract: Despite recent advances in large language models (LLMs) for materials science, there is a lack of benchmarks for evaluating their domain-specific knowledge and complex reasoning abilities. To bridge this gap, we introduce MSQA, a comprehensive evaluation benchmark of 1,757 graduate-level materials science questions in two formats: detailed explanatory responses and binary True/False assessments. MSQA distinctively challenges LLMs by requiring both precise factual knowledge and multi-step reasoning across seven materials science sub-fields, such as structure-property relationships, synthesis processes, and computational modeling. Through experiments with 10 state-of-the-art LLMs, we identify significant gaps in current LLM performance. While API-based proprietary LLMs achieve up to 84.5% accuracy, open-source (OSS) LLMs peak around 60.5%, and domain-specific LLMs often underperform significantly due to overfitting and distributional shifts. MSQA represents the first benchmark to jointly evaluate the factual and reasoning capabilities of LLMs crucial for LLMs in advanced materials science.

摘要

尽管大规模语言模型(LLMs)在材料科学领域取得了最新进展,但目前仍缺乏评估其领域专业知识和复杂推理能力的基准测试。为填补这一空白,我们提出了MSQA——一个包含1,757道研究生级别材料科学问题的综合评估基准,提供详细解释性回答和二元真/假判断两种形式。MSQA通过要求模型在七个材料科学子领域(如结构-性能关系、合成工艺和计算建模等)同时具备精确的事实知识和多步推理能力,对LLMs形成了独特挑战。通过对10个最先进LLMs的实验测试,我们发现当前模型性能存在显著差距:基于API的专有LLMs最高达到84.5%准确率,开源(OSS)LLMs峰值约为60.5%,而领域专用LLMs因过拟合和分布偏移问题表现普遍欠佳。MSQA是首个能联合评估LLMs事实掌握与推理能力的基准测试,这两项能力对先进材料科学领域的LLMs应用至关重要。


mRAG: Elucidating the Design Space of Multi-modal Retrieval-Augmented Generation

Abstract

arXiv:2505.24073v1 Announce Type: new Abstract: Large Vision-Language Models (LVLMs) have made remarkable strides in multimodal tasks such as visual question answering, visual grounding, and complex reasoning. However, they remain limited by static training data, susceptibility to hallucinations, and inability to verify claims against up-to-date, external evidence, compromising their performance in dynamic real-world applications. Retrieval-Augmented Generation (RAG) offers a practical solution to mitigate these challenges by allowing the LVLMs to access large-scale knowledge databases via retrieval mechanisms, thereby grounding model outputs in factual, contextually relevant information. Here in this paper, we conduct the first systematic dissection of the multimodal RAG pipeline for LVLMs, explicitly investigating (1) the retrieval phase: on the modality configurations and retrieval strategies, (2) the re-ranking stage: on strategies to mitigate positional biases and improve the relevance of retrieved evidence, and (3) the generation phase: we further investigate how to best integrate retrieved candidates into the final generation process. Finally, we extend to explore a unified agentic framework that integrates re-ranking and generation through self-reflection, enabling LVLMs to select relevant evidence and suppress irrelevant context dynamically. Our full-stack exploration of RAG for LVLMs yields substantial insights, resulting in an average performance boost of 5% without any fine-tuning.

摘要

大型视觉语言模型(LVLMs)在视觉问答、视觉定位和复杂推理等多模态任务中取得了显著进展。然而,它们仍受限于静态训练数据、易产生幻觉以及无法根据最新外部证据验证主张等问题,这影响了其在动态现实应用中的表现。检索增强生成(RAG)通过让LVLMs借助检索机制访问大规模知识库,将模型输出基于事实性、上下文相关的信息,为缓解这些挑战提供了实用解决方案。本文首次系统剖析了面向LVLMs的多模态RAG流程,具体研究:(1)检索阶段:探讨模态配置与检索策略;(2)重排序阶段:研究减轻位置偏差和提高检索证据相关性的策略;(3)生成阶段:深入分析如何最优整合检索候选集至最终生成过程。最后,我们进一步探索通过自反思整合重排序与生成的统一代理框架,使LVLMs能动态选择相关证据并抑制无关上下文。针对LVLMs的RAG全栈研究获得了重要洞见,在无需微调的情况下平均性能提升达5%。


Leave it to the Specialist: Repair Sparse LLMs with Sparse Fine-Tuning via Sparsity Evolution

Abstract

arXiv:2505.24037v1 Announce Type: new Abstract: Large language models (LLMs) have achieved remarkable success across various tasks but face deployment challenges due to their massive computational demands. While post-training pruning methods like SparseGPT and Wanda can effectively reduce the model size, but struggle to maintain model performance at high sparsity levels, limiting their utility for downstream tasks. Existing fine-tuning methods, such as full fine-tuning and LoRA, fail to preserve sparsity as they require updating the whole dense metrics, not well-suited for sparse LLMs. In this paper, we propose Sparsity Evolution Fine-Tuning (SEFT), a novel method designed specifically for sparse LLMs. SEFT dynamically evolves the sparse topology of pruned models during fine-tuning, while preserving the overall sparsity throughout the process. The strengths of SEFT lie in its ability to perform task-specific adaptation through a weight drop-and-grow strategy, enabling the pruned model to self-adapt its sparse connectivity pattern based on the target dataset. Furthermore, a sensitivity-driven pruning criterion is employed to ensure that the desired sparsity level is consistently maintained throughout fine-tuning. Our experiments on various LLMs, including LLaMA families, DeepSeek, and Mistral, across a diverse set of benchmarks demonstrate that SEFT achieves stronger performance while offering superior memory and time efficiency compared to existing baselines. Our code is publicly available at: https://github.com/QiaoXiao7282/SEFT.

摘要

大型语言模型(LLMs)虽在各种任务中取得显著成功,但其庞大的计算需求导致部署面临挑战。尽管SparseGPT和Wanda等训练后剪枝方法能有效缩减模型规模,但在高稀疏度下难以保持模型性能,限制了其在下游任务中的应用。现有微调方法(如全参数微调和LoRA)由于需更新整个稠密矩阵而无法保持稀疏性,并不适用于稀疏LLMs。本文提出稀疏性演化微调(SEFT),这是一种专为稀疏LLMs设计的新方法。SEFT在微调过程中动态演化剪枝模型的稀疏拓扑结构,同时全程保持整体稀疏度。其优势在于通过权重丢弃-生长策略实现任务自适应,使剪枝模型能根据目标数据集自我调整稀疏连接模式。此外,采用敏感度驱动的剪枝准则确保微调过程中始终维持目标稀疏度。我们在LLaMA系列、DeepSeek和Mistral等多种LLMs上的多基准测试表明,SEFT在保持更优内存和时间效率的同时,实现了比现有基线更强的性能。代码已开源:https://github.com/QiaoXiao7282/SEFT。


Using Reasoning Models to Generate Search Heuristics that Solve Open Instances of Combinatorial Design Problems

Abstract

arXiv:2505.23881v1 Announce Type: new Abstract: Large Language Models (LLMs) with reasoning are trained to iteratively generate and refine their answers before finalizing them, which can help with applications to mathematics and code generation. We apply code generation with reasoning LLMs to a specific task in the mathematical field of combinatorial design. This field studies diverse types of combinatorial designs, many of which have lists of open instances for which existence has not yet been determined. The Constructive Protocol CPro1 uses LLMs to generate search heuristics that have the potential to construct solutions to small open instances. Starting with a textual definition and a validity verifier for a particular type of design, CPro1 guides LLMs to select and implement strategies, while providing automated hyperparameter tuning and execution feedback. CPro1 with reasoning LLMs successfully solves long-standing open instances for 7 of 16 combinatorial design problems selected from the 2006 Handbook of Combinatorial Designs, including new solved instances for 3 of these (Bhaskar Rao Designs, Symmetric Weighing Matrices, Balanced Ternary Designs) that were unsolved by CPro1 with non-reasoning LLMs. It also solves open instances for several problems from recent (2025) literature, generating new Covering Sequences, Johnson Clique Covers, Deletion Codes, and a Uniform Nested Steiner Quadruple System.

摘要

具备推理能力的大语言模型(LLMs)经过训练,可在最终确定答案前迭代生成并优化结果,这有助于数学及代码生成领域的应用。本研究将基于推理LLMs的代码生成技术应用于组合设计数学领域的特定任务。该领域研究多种类型的组合设计,其中许多存在尚未确定存在性的开放实例列表。构造协议CPro1利用LLMs生成搜索启发式方法,这些方法有望为小型开放实例构建解决方案。CPro1从特定设计的文本定义和有效性验证器出发,引导LLMs选择并实施策略,同时提供自动化超参数调优和执行反馈。采用推理LLMs的CPro1成功解决了选自2006年《组合设计手册》的16个组合设计问题中7个长期未决的开放实例,其中包括3个非推理LLMs版CPro1未能解决的新实例(Bhaskar Rao设计、对称称重矩阵、平衡三元设计)。该方法还解决了近期(2025年)文献中多个问题的开放实例,生成了新的覆盖序列、Johnson团覆盖、删除码以及一个均匀嵌套Steiner四重系统。


GenIC: An LLM-Based Framework for Instance Completion in Knowledge Graphs

Abstract

arXiv:2505.24036v1 Announce Type: new Abstract: Knowledge graph completion aims to address the gaps of knowledge bases by adding new triples that represent facts. The complexity of this task depends on how many parts of a triple are already known. Instance completion involves predicting the relation-tail pair when only the head is given (h, ?, ?). Notably, modern knowledge bases often contain entity descriptions and types, which can provide valuable context for inferring missing facts. By leveraging these textual descriptions and the ability of large language models to extract facts from them and recognize patterns within the knowledge graph schema, we propose an LLM-powered, end-to-end instance completion approach. Specifically, we introduce GenIC: a two-step Generative Instance Completion framework. The first step focuses on property prediction, treated as a multi-label classification task. The second step is link prediction, framed as a generative sequence-to-sequence task. Experimental results on three datasets show that our method outperforms existing baselines. Our code is available at https://github.com/amal-gader/genic.

摘要

知识图谱补全旨在通过添加表示事实的新三元组来填补知识库的空白。该任务的复杂程度取决于三元组中已知部分的数量。实例补全任务要求在仅给定头实体时预测关系-尾实体对(h, ?, ?)。值得注意的是,现代知识库通常包含实体描述和类型,这些信息可为推断缺失事实提供有价值的上下文。通过利用这些文本描述以及大语言模型从中提取事实并识别知识图谱模式规律的能力,我们提出了一种基于LLM的端到端实例补全方法。具体而言,我们提出了GenIC:一个两阶段的生成式实例补全框架。第一阶段将属性预测视为多标签分类任务,第二阶段则将链接预测构建为生成式序列到序列任务。在三个数据集上的实验结果表明,我们的方法优于现有基线。代码已开源:https://github.com/amal-gader/genic。


Multi-RAG: A Multimodal Retrieval-Augmented Generation System for Adaptive Video Understanding

Abstract

arXiv:2505.23990v1 Announce Type: new Abstract: To effectively engage in human society, the ability to adapt, filter information, and make informed decisions in ever-changing situations is critical. As robots and intelligent agents become more integrated into human life, there is a growing opportunity-and need-to offload the cognitive burden on humans to these systems, particularly in dynamic, information-rich scenarios. To fill this critical need, we present Multi-RAG, a multimodal retrieval-augmented generation system designed to provide adaptive assistance to humans in information-intensive circumstances. Our system aims to improve situational understanding and reduce cognitive load by integrating and reasoning over multi-source information streams, including video, audio, and text. As an enabling step toward long-term human-robot partnerships, Multi-RAG explores how multimodal information understanding can serve as a foundation for adaptive robotic assistance in dynamic, human-centered situations. To evaluate its capability in a realistic human-assistance proxy task, we benchmarked Multi-RAG on the MMBench-Video dataset, a challenging multimodal video understanding benchmark. Our system achieves superior performance compared to existing open-source video large language models (Video-LLMs) and large vision-language models (LVLMs), while utilizing fewer resources and less input data. The results demonstrate Multi- RAG's potential as a practical and efficient foundation for future human-robot adaptive assistance systems in dynamic, real-world contexts.

摘要

在人类社会中有效发挥作用的关键能力,在于适应变化环境、过滤信息并做出明智决策。随着机器人和智能体日益融入人类生活,将人类的认知负担转移至这些系统——尤其是在动态且信息丰富的场景中——正形成重要机遇与需求。为满足这一关键需求,我们提出Multi-RAG:一种多模态检索增强生成系统,旨在信息密集型场景中为人类提供自适应辅助。该系统通过整合并推理视频、音频和文本等多源信息流,以提升情境理解能力并降低认知负荷。作为实现长期人机协作的基础步骤,Multi-RAG探索了多模态信息理解如何成为动态人本场景中自适应机器人辅助的基石。为评估其在现实人类辅助代理任务中的能力,我们在MMBench-Video数据集(一个具有挑战性的多模态视频理解基准)上对Multi-RAG进行了测试。相较于现有开源视频大语言模型(Video-LLMs)和大视觉语言模型(LVLMs),我们的系统在消耗更少资源和输入数据的情况下实现了更优性能。结果表明,Multi-RAG具备作为动态现实场景中未来人机自适应辅助系统的实用高效基础架构潜力。


OWL: Optimized Workforce Learning for General Multi-Agent Assistance in Real-World Task Automation

Abstract

arXiv:2505.23885v1 Announce Type: new Abstract: Large Language Model (LLM)-based multi-agent systems show promise for automating real-world tasks but struggle to transfer across domains due to their domain-specific nature. Current approaches face two critical shortcomings: they require complete architectural redesign and full retraining of all components when applied to new domains. We introduce Workforce, a hierarchical multi-agent framework that decouples strategic planning from specialized execution through a modular architecture comprising: (i) a domain-agnostic Planner for task decomposition, (ii) a Coordinator for subtask management, and (iii) specialized Workers with domain-specific tool-calling capabilities. This decoupling enables cross-domain transferability during both inference and training phases: During inference, Workforce seamlessly adapts to new domains by adding or modifying worker agents; For training, we introduce Optimized Workforce Learning (OWL), which improves generalization across domains by optimizing a domain-agnostic planner with reinforcement learning from real-world feedback. To validate our approach, we evaluate Workforce on the GAIA benchmark, covering various realistic, multi-domain agentic tasks. Experimental results demonstrate Workforce achieves open-source state-of-the-art performance (69.70%), outperforming commercial systems like OpenAI's Deep Research by 2.34%. More notably, our OWL-trained 32B model achieves 52.73% accuracy (+16.37%) and demonstrates performance comparable to GPT-4o on challenging tasks. To summarize, by enabling scalable generalization and modular domain transfer, our work establishes a foundation for the next generation of general-purpose AI assistants.

摘要

基于大语言模型(LLM)的多智能体系统在自动化现实任务方面展现出潜力,但由于其领域特定性,难以实现跨领域迁移。现有方法存在两个关键缺陷:应用于新领域时需要完全重构架构并重新训练所有组件。我们提出Workforce——一种分层多智能体框架,通过模块化架构实现战略规划与专业执行的解耦,该架构包含:(i)用于任务分解的领域无关规划器;(ii)子任务管理协调器;(iii)具备领域特定工具调用能力的专业化工作器。这种解耦设计在推理和训练阶段均支持跨领域迁移:推理时通过增减或修改工作器即可无缝适配新领域;训练阶段我们提出优化工作器学习(OWL),通过基于现实反馈的强化学习优化领域无关规划器,提升跨领域泛化能力。在GAIA基准测试中,Workforce覆盖多种现实跨领域智能体任务,实验结果表明其以69.70%的准确率取得开源领域最优性能,较OpenAI深度研究等商业系统高出2.34%。值得注意的是,经OWL训练的320亿参数模型达到52.73%准确率(提升16.37%),在挑战性任务上表现媲美GPT-4o。本研究通过实现可扩展的泛化能力和模块化领域迁移,为下一代通用人工智能助手奠定了基础。


An Adversary-Resistant Multi-Agent LLM System via Credibility Scoring

Abstract

arXiv:2505.24239v1 Announce Type: new Abstract: While multi-agent LLM systems show strong capabilities in various domains, they are highly vulnerable to adversarial and low-performing agents. To resolve this issue, in this paper, we introduce a general and adversary-resistant multi-agent LLM framework based on credibility scoring. We model the collaborative query-answering process as an iterative game, where the agents communicate and contribute to a final system output. Our system associates a credibility score that is used when aggregating the team outputs. The credibility scores are learned gradually based on the past contributions of each agent in query answering. Our experiments across multiple tasks and settings demonstrate our system's effectiveness in mitigating adversarial influence and enhancing the resilience of multi-agent cooperation, even in the adversary-majority settings.

摘要

尽管多智能体大语言模型系统在多个领域展现出强大能力,但其极易受到对抗性智能体和低性能智能体的影响。为解决这一问题,本文提出一种基于可信度评分的通用抗对抗多智能体大语言模型框架。我们将协作式问答过程建模为迭代博弈,其中智能体通过通信共同生成最终系统输出。该系统通过可信度评分来聚合团队输出,该评分根据各智能体在历史问答中的贡献度逐步学习获得。我们在多种任务和场景下的实验表明,即使对抗性智能体占多数,本系统仍能有效减轻对抗性影响,增强多智能体协作的鲁棒性。


InterMT: Multi-Turn Interleaved Preference Alignment with Human Feedback

Abstract

arXiv:2505.23950v1 Announce Type: new Abstract: As multimodal large models (MLLMs) continue to advance across challenging tasks, a key question emerges: What essential capabilities are still missing? A critical aspect of human learning is continuous interaction with the environment -- not limited to language, but also involving multimodal understanding and generation. To move closer to human-level intelligence, models must similarly support multi-turn, multimodal interaction. In particular, they should comprehend interleaved multimodal contexts and respond coherently in ongoing exchanges. In this work, we present an initial exploration through the InterMT -- the first preference dataset for multi-turn multimodal interaction, grounded in real human feedback. In this exploration, we particularly emphasize the importance of human oversight, introducing expert annotations to guide the process, motivated by the fact that current MLLMs lack such complex interactive capabilities. InterMT captures human preferences at both global and local levels into nine sub-dimensions, consists of 15.6k prompts, 52.6k multi-turn dialogue instances, and 32.4k human-labeled preference pairs. To compensate for the lack of capability for multi-modal understanding and generation, we introduce an agentic workflow that leverages tool-augmented MLLMs to construct multi-turn QA instances. To further this goal, we introduce InterMT-Bench to assess the ability of MLLMs in assisting judges with multi-turn, multimodal tasks. We demonstrate the utility of \InterMT through applications such as judge moderation and further reveal the multi-turn scaling law of judge model. We hope the open-source of our data can help facilitate further research on aligning current MLLMs to the next step. Our project website can be found at https://pku-intermt.github.io .

摘要

随着多模态大模型(MLLMs)在各类挑战性任务中不断取得进展,一个关键问题随之浮现:当前模型仍缺失哪些核心能力?人类学习的关键特征在于与环境的持续交互——这种交互不仅限于语言,还涉及多模态理解与生成。为实现更接近人类水平的智能,模型必须同样支持多轮次、多模态的交互。尤其需要具备对交织多模态上下文的理解能力,并在持续对话中作出连贯响应。本研究通过InterMT数据集展开初步探索——这是首个基于真实人类反馈构建的多轮多模态交互偏好数据集。在此过程中,我们特别强调人类监督的重要性,通过引入专家标注来指导流程,这源于当前MLLMs确实缺乏此类复杂交互能力的客观事实。InterMT将人类偏好从全局和局部两个层面细分为九个子维度,包含15.6k条提示、52.6k个多轮对话实例及32.4k组人工标注的偏好对。为弥补多模态理解与生成能力的不足,我们提出一种代理工作流,利用工具增强的MLLMs来构建多轮问答实例。为进一步推进目标,我们推出InterMT-Bench评估框架,用于衡量MLLMs在辅助裁判完成多轮多模态任务时的表现。通过裁判模型调节等应用场景,我们验证了InterMT的实用价值,并揭示了裁判模型的多轮扩展规律。我们希望开源数据能促进学界对现有MLLMs进行更深入的对齐研究。项目网站详见https://pku-intermt.github.io。


ProofNet++: A Neuro-Symbolic System for Formal Proof Verification with Self-Correction

Abstract

arXiv:2505.24230v1 Announce Type: new Abstract: We propose ProofNet++, a neuro-symbolic framework that enhances automated theorem proving by combining large language models (LLMs) with formal proof verification and self-correction mechanisms. Current LLM-based systems suffer from hallucinated logical steps and unverifiable reasoning. ProofNet++ mitigates these limitations by integrating symbolic proof tree supervision, a reinforcement learning loop using verifiers as reward functions, and an iterative self-correction module. Our experiments on miniF2F, Lean's mathlib, and HOL Light show that ProofNet++ significantly improves proof accuracy, correctness, and formal verifiability over prior models. We provide theoretical analysis of the convergence and stability of the verifier-guided RL framework and release our datasets and codebase for future research.

摘要

我们提出ProofNet++,一种神经符号框架,通过将大型语言模型(LLMs)与形式化证明验证及自我修正机制相结合,增强了自动定理证明能力。当前基于LLM的系统存在逻辑步骤虚构和推理不可验证的问题。ProofNet++通过整合符号化证明树监督、采用验证器作为奖励函数的强化学习循环以及迭代式自我修正模块,有效缓解了这些局限性。在miniF2F、Lean的mathlib和HOL Light上的实验表明,ProofNet++较先前模型显著提升了证明准确性、正确性及形式化可验证性。我们对验证器引导的强化学习框架的收敛性和稳定性进行了理论分析,并公开了数据集与代码库以供后续研究。


Bootstrapping LLM Robustness for VLM Safety via Reducing the Pretraining Modality Gap

Abstract

arXiv:2505.24208v1 Announce Type: new Abstract: Ensuring Vision-Language Models (VLMs) generate safe outputs is crucial for their reliable deployment. However, LVLMs suffer from drastic safety degradation compared to their LLM backbone. Even blank or irrelevant images can trigger LVLMs to generate harmful responses to prompts that would otherwise be refused in text-only contexts. The modality gap between image and text representations has been recently hypothesized to contribute to safety degradation of LVLMs. However, if and how the amount of modality gap affects LVLMs' safety is not studied. In this work, we show that the amount of modality gap is highly inversely correlated with VLMs' safety. Then, we show that this modality gap is introduced during pretraining LVLMs and persists through fine-tuning. Inspired by this observation, we propose a regularization to reduce the modality gap during pretraining. Our extensive experiments on LLaVA v1.5, ShareGPT4V, and MiniGPT-4 show that our method substantially improves safety alignment of LVLMs, reducing unsafe rate by up to 16.3% without compromising performance, and can further boost existing defenses by up to 18.2%.

摘要

确保视觉-语言模型(VLM)生成安全输出对其可靠部署至关重要。然而,与纯文本大语言模型(LLM)相比,多模态大语言模型(LVLM)存在显著的安全性退化问题。即便是空白或不相关图像,也可能触发LVLM对原本在纯文本环境下会拒绝的提示生成有害回应。近期研究假设图像与文本表征之间的模态差异是导致LVLM安全性退化的原因之一,但模态差异程度如何影响LVLM安全性尚未得到研究。本工作首次揭示模态差异程度与VLM安全性呈高度负相关,并证明该差异产生于LVLM预训练阶段且能持续影响微调过程。基于此发现,我们提出一种预训练阶段的模态差异正则化方法。在LLaVA v1.5、ShareGPT4V和MiniGPT-4上的大量实验表明:该方法在不影响性能的前提下,最高可降低16.3%的不安全响应率;当与现有防御机制结合时,能进一步提升18.2%的安全防护效果。


E^2GraphRAG: Streamlining Graph-based RAG for High Efficiency and Effectiveness

Abstract

arXiv:2505.24226v1 Announce Type: new Abstract: Graph-based RAG methods like GraphRAG have shown promising global understanding of the knowledge base by constructing hierarchical entity graphs. However, they often suffer from inefficiency and rely on manually pre-defined query modes, limiting practical use. In this paper, we propose E^2GraphRAG, a streamlined graph-based RAG framework that improves both Efficiency and Effectiveness. During the indexing stage, E^2GraphRAG constructs a summary tree with large language models and an entity graph with SpaCy based on document chunks. We then construct bidirectional indexes between entities and chunks to capture their many-to-many relationships, enabling fast lookup during both local and global retrieval. For the retrieval stage, we design an adaptive retrieval strategy that leverages the graph structure to retrieve and select between local and global modes. Experiments show that E^2GraphRAG achieves up to 10 times faster indexing than GraphRAG and 100 times speedup over LightRAG in retrieval while maintaining competitive QA performance.

摘要

基于图的检索增强生成方法(如GraphRAG)通过构建层次化实体图谱,展现出对知识库全局理解的良好潜力。然而,这类方法通常存在效率低下且依赖人工预定义查询模式的问题,限制了实际应用。本文提出E^2GraphRAG——一个高效且有效的流线型图式RAG框架。在索引阶段,E^2GraphRAG利用大语言模型构建摘要树,并基于文档块通过SpaCy生成实体图。随后建立实体与文档块间的双向索引以捕获其多对多关系,从而实现局部与全局检索时的快速查找。在检索阶段,我们设计了自适应检索策略,通过图结构动态选择局部或全局检索模式。实验表明,E^2GraphRAG的索引速度较GraphRAG提升高达10倍,检索速度较LightRAG快100倍,同时保持具有竞争力的问答性能。


SentinelAgent: Graph-based Anomaly Detection in Multi-Agent Systems

Abstract

arXiv:2505.24201v1 Announce Type: new Abstract: The rise of large language model (LLM)-based multi-agent systems (MAS) introduces new security and reliability challenges. While these systems show great promise in decomposing and coordinating complex tasks, they also face multi-faceted risks across prompt manipulation, unsafe tool usage, and emergent agent miscoordination. Existing guardrail mechanisms offer only partial protection, primarily at the input-output level, and fall short in addressing systemic or multi-point failures in MAS. In this work, we present a system-level anomaly detection framework tailored for MAS, integrating structural modeling with runtime behavioral oversight. Our approach consists of two components. First, we propose a graph-based framework that models agent interactions as dynamic execution graphs, enabling semantic anomaly detection at node, edge, and path levels. Second, we introduce a pluggable SentinelAgent, an LLM-powered oversight agent that observes, analyzes, and intervenes in MAS execution based on security policies and contextual reasoning. By bridging abstract detection logic with actionable enforcement, our method detects not only single-point faults and prompt injections but also multi-agent collusion and latent exploit paths. We validate our framework through two case studies, including an email assistant and Microsoft's Magentic-One system, demonstrating its ability to detect covert risks and provide explainable root-cause attribution. Our work lays the foundation for more trustworthy, monitorable, and secure agent-based AI ecosystems.

摘要

基于大语言模型(LLM)的多智能体系统(MAS)的兴起带来了新的安全性与可靠性挑战。尽管这类系统在分解和协调复杂任务方面展现出巨大潜力,但其仍面临提示词操纵、工具不安全使用以及智能体协同失调等多维度风险。现有防护机制仅能提供输入输出层面的局部保护,难以应对MAS中的系统性或多点故障。本研究提出一个面向MAS的系统级异常检测框架,将结构建模与运行时行为监控相结合。该框架包含两个核心组件:首先,我们设计基于图的建模方法,将智能体交互表示为动态执行图,实现节点、边和路径层面的语义异常检测;其次,我们引入可插拔的哨兵智能体(SentinelAgent),这个由LLM驱动的监督智能体能基于安全策略和上下文推理,对MAS执行过程进行观测、分析和干预。通过将抽象检测逻辑与可执行措施相衔接,本方法不仅能检测单点故障和提示词注入攻击,还能识别多智能体共谋和潜在攻击路径。我们通过电子邮件助手和微软Magentic-One系统两个案例验证了该框架,证明其可有效检测隐蔽风险并提供可解释的根因溯源。本研究为构建更可信、可监测且安全的智能体AI生态系统奠定了基础。


Learning API Functionality from Demonstrations for Tool-based Agents

Abstract

arXiv:2505.24197v1 Announce Type: new Abstract: Digital tool-based agents that invoke external Application Programming Interfaces (APIs) often rely on documentation to understand API functionality. However, such documentation is frequently missing, outdated, privatized, or inconsistent-hindering the development of reliable, general-purpose agents. In this work, we propose learning API functionality directly from demonstrations as a new paradigm applicable in scenarios without documentation. Using existing API benchmarks, we collect demonstrations from both expert API-based agents and from self-exploration. To understand what information demonstrations must convey for successful task completion, we extensively study how the number of demonstrations and the use of LLM-generated summaries and evaluations affect the task success rate of the API-based agent. Our experiments across 3 datasets and 5 models show that learning functionality from demonstrations remains a non-trivial challenge, even for state-of-the-art LLMs. We find that providing explicit function calls and natural language critiques significantly improves the agent's task success rate due to more accurate parameter filling. We analyze failure modes, identify sources of error, and highlight key open challenges for future work in documentation-free, self-improving, API-based agents.

摘要

基于数字工具、通过调用外部应用程序接口(API)的智能体通常依赖文档来理解API功能。然而此类文档往往存在缺失、过时、私有化或不一致等问题,这阻碍了开发可靠通用型智能体的进程。本研究提出直接从演示中学习API功能的新范式,适用于无文档支持的场景。利用现有API基准测试集,我们分别从专家API智能体和自主探索中收集演示数据。为明确演示必须传递何种信息才能成功完成任务,我们深入研究了演示数量、大语言模型生成的摘要与评估对API智能体任务成功率的影响。通过在3个数据集和5个模型上的实验表明,即使对于最先进的大语言模型,从演示中学习功能仍是一项非平凡挑战。研究发现,提供显式函数调用和自然语言评述能显著提升智能体任务成功率,这主要归因于参数填充准确性的提高。我们分析了故障模式,识别错误来源,并重点指出了未来无文档、自改进API智能体研究面临的关键开放性问题。


SCOUT: Teaching Pre-trained Language Models to Enhance Reasoning via Flow Chain-of-Thought

Abstract

arXiv:2505.24181v1 Announce Type: new Abstract: Chain of Thought (CoT) prompting improves the reasoning performance of large language models (LLMs) by encouraging step by step thinking. However, CoT-based methods depend on intermediate reasoning steps, which limits scalability and generalization. Recent work explores recursive reasoning, where LLMs reuse internal layers across iterations to refine latent representations without explicit CoT supervision. While promising, these approaches often require costly pretraining and lack a principled framework for how reasoning should evolve across iterations. We address this gap by introducing Flow Chain of Thought (Flow CoT), a reasoning paradigm that models recursive inference as a progressive trajectory of latent cognitive states. Flow CoT frames each iteration as a distinct cognitive stage deepening reasoning across iterations without relying on manual supervision. To realize this, we propose SCOUT (Stepwise Cognitive Optimization Using Teachers), a lightweight fine tuning framework that enables Flow CoT style reasoning without the need for pretraining. SCOUT uses progressive distillation to align each iteration with a teacher of appropriate capacity, and a cross attention based retrospective module that integrates outputs from previous iterations while preserving the models original computation flow. Experiments across eight reasoning benchmarks show that SCOUT consistently improves both accuracy and explanation quality, achieving up to 1.8% gains under fine tuning. Qualitative analyses further reveal that SCOUT enables progressively deeper reasoning across iterations refining both belief formation and explanation granularity. These results not only validate the effectiveness of SCOUT, but also demonstrate the practical viability of Flow CoT as a scalable framework for enhancing reasoning in LLMs.

摘要

思维链(CoT)提示通过鼓励逐步思考来提升大语言模型(LLMs)的推理性能。然而,基于CoT的方法依赖于中间推理步骤,这限制了其可扩展性和泛化能力。近期研究探索了递归推理方法,使LLMs在迭代中复用内部层以优化潜在表征,而无需显式的CoT监督。尽管前景可观,这些方法通常需要昂贵的预训练,且缺乏关于推理应如何跨迭代演进的原则性框架。为此,我们提出流思维链(Flow CoT),这是一种将递归推理建模为潜在认知状态渐进轨迹的推理范式。Flow CoT将每次迭代视为深化推理的独立认知阶段,无需依赖人工监督。为实现这一目标,我们提出SCOUT(基于教师的分步认知优化)——一个轻量级微调框架,可在无需预训练的情况下实现Flow CoT式推理。SCOUT采用渐进式蒸馏使每次迭代与适当容量的教师模型对齐,并通过基于交叉注意力的回顾模块整合先前迭代的输出,同时保留模型原始计算流。在八个推理基准上的实验表明,SCOUT持续提升了准确性和解释质量,在微调条件下最高获得1.8%的性能提升。定性分析进一步揭示,SCOUT能实现跨迭代的渐进深度推理,优化信念形成和解释粒度。这些结果不仅验证了SCOUT的有效性,也证明了Flow CoT作为增强LLMs推理能力的可扩展框架具有实际可行性。


FABLE: A Novel Data-Flow Analysis Benchmark on Procedural Text for Large Language Model Evaluation

Abstract

arXiv:2505.24258v1 Announce Type: new Abstract: Understanding how data moves, transforms, and persists, known as data flow, is fundamental to reasoning in procedural tasks. Despite their fluency in natural and programming languages, large language models (LLMs), although increasingly being applied to decisions with procedural tasks, have not been systematically evaluated for their ability to perform data-flow reasoning. We introduce FABLE, an extensible benchmark designed to assess LLMs' understanding of data flow using structured, procedural text. FABLE adapts eight classical data-flow analyses from software engineering: reaching definitions, very busy expressions, available expressions, live variable analysis, interval analysis, type-state analysis, taint analysis, and concurrency analysis. These analyses are instantiated across three real-world domains: cooking recipes, travel routes, and automated plans. The benchmark includes 2,400 question-answer pairs, with 100 examples for each domain-analysis combination. We evaluate three types of LLMs: a reasoning-focused model (DeepSeek-R1 8B), a general-purpose model (LLaMA 3.1 8B), and a code-specific model (Granite Code 8B). Each model is tested using majority voting over five sampled completions per prompt. Results show that the reasoning model achieves higher accuracy, but at the cost of over 20 times slower inference compared to the other models. In contrast, the general-purpose and code-specific models perform close to random chance. FABLE provides the first diagnostic benchmark to systematically evaluate data-flow reasoning and offers insights for developing models with stronger procedural understanding.

摘要

理解数据如何移动、转换和持久化(即数据流)是进行程序性任务推理的基础。尽管大语言模型(LLMs)在自然语言和编程语言方面表现出色,并越来越多地应用于程序性任务决策,但其数据流推理能力尚未得到系统评估。我们提出了FABLE——一个可扩展的基准测试,旨在利用结构化程序文本来评估LLMs对数据流的理解能力。FABLE适配了软件工程中的八种经典数据流分析:到达定义、非常繁忙表达式、可用表达式、活跃变量分析、区间分析、类型状态分析、污点分析以及并发分析。这些分析实例化在三个现实领域:烹饪食谱、旅行路线和自动化计划。该基准包含2,400个问答对,每个领域-分析组合有100个示例。我们评估了三类LLMs:专注推理的模型(DeepSeek-R1 8B)、通用模型(LLaMA 3.1 8B)和代码专用模型(Granite Code 8B)。每个模型通过每个提示五次采样补全的多数投票进行测试。结果表明,推理模型准确率更高,但推理速度比其他模型慢20倍以上;而通用模型和代码专用模型的表现接近随机猜测。FABLE提供了首个系统性评估数据流推理的诊断基准,并为开发具有更强程序理解能力的模型提供了见解。


GridRoute: A Benchmark for LLM-Based Route Planning with Cardinal Movement in Grid Environments

Abstract

arXiv:2505.24306v1 Announce Type: new Abstract: Recent advancements in Large Language Models (LLMs) have demonstrated their potential in planning and reasoning tasks, offering a flexible alternative to classical pathfinding algorithms. However, most existing studies focus on LLMs' independent reasoning capabilities and overlook the potential synergy between LLMs and traditional algorithms. To fill this gap, we propose a comprehensive evaluation benchmark GridRoute to assess how LLMs can take advantage of traditional algorithms. We also propose a novel hybrid prompting technique called Algorithm of Thought (AoT), which introduces traditional algorithms' guidance into prompting. Our benchmark evaluates six LLMs ranging from 7B to 72B parameters across various map sizes, assessing their performance in correctness, optimality, and efficiency in grid environments with varying sizes. Our results show that AoT significantly boosts performance across all model sizes, particularly in larger or more complex environments, suggesting a promising approach to addressing path planning challenges. Our code is open-sourced at https://github.com/LinChance/GridRoute.

摘要

大语言模型(LLMs)的最新进展展示了其在规划与推理任务中的潜力,为传统路径搜索算法提供了灵活的替代方案。然而,现有研究大多关注LLMs的独立推理能力,忽视了LLMs与传统算法间的协同潜力。为填补这一空白,我们提出综合性评估基准GridRoute,用以评估LLMs如何利用传统算法优势。同时,我们提出一种新型混合提示技术"思维算法"(AoT),将传统算法指导引入提示过程。该基准测试了参数量从70亿到720亿不等的六种LLM在不同地图尺寸下的表现,评估其在各尺寸网格环境中正确性、最优性和效率方面的性能。结果表明,AoT能显著提升所有规模模型的性能,尤其在更大或更复杂的环境中,为解决路径规划挑战提供了可行方案。代码已开源:https://github.com/LinChance/GridRoute。


Mind the Quote: Enabling Quotation-Aware Dialogue in LLMs via Plug-and-Play Modules

Abstract

arXiv:2505.24292v1 Announce Type: new Abstract: Human-AI conversation frequently relies on quoting earlier text-"check it with the formula I just highlighted"-yet today's large language models (LLMs) lack an explicit mechanism for locating and exploiting such spans. We formalise the challenge as span-conditioned generation, decomposing each turn into the dialogue history, a set of token-offset quotation spans, and an intent utterance. Building on this abstraction, we introduce a quotation-centric data pipeline that automatically synthesises task-specific dialogues, verifies answer correctness through multi-stage consistency checks, and yields both a heterogeneous training corpus and the first benchmark covering five representative scenarios. To meet the benchmark's zero-overhead and parameter-efficiency requirements, we propose QuAda, a lightweight training-based method that attaches two bottleneck projections to every attention head, dynamically amplifying or suppressing attention to quoted spans at inference time while leaving the prompt unchanged and updating < 2.8% of backbone weights. Experiments across models show that QuAda is suitable for all scenarios and generalises to unseen topics, offering an effective, plug-and-play solution for quotation-aware dialogue.

摘要

人机对话经常需要引用先前文本——“用我刚高亮的公式核对”——然而当前大型语言模型(LLMs)缺乏定位和利用此类文本段的显式机制。我们将该挑战形式化为跨度条件生成问题,将每个对话轮次分解为对话历史、一组词符偏移的引用跨度以及意图话语。基于此抽象框架,我们提出以引用为中心的数据处理流程:自动生成任务特定对话,通过多阶段一致性检查验证答案正确性,最终产出异构训练语料库和首个涵盖五种典型场景的基准测试集。为满足该基准的零开销与参数高效要求,我们提出QuAda——一种轻量级训练方法,该方法在每个注意力头附加双重瓶颈投影,在推理时动态增强或抑制对引用跨度的注意力,同时保持提示不变且仅更新<2.8%的主干权重。跨模型实验表明,QuAda适用于所有场景并能泛化至未见主题,为引用感知对话提供了即插即用的有效解决方案。


RMoA: Optimizing Mixture-of-Agents through Diversity Maximization and Residual Compensation

Abstract

arXiv:2505.24442v1 Announce Type: new Abstract: Although multi-agent systems based on large language models show strong capabilities on multiple tasks, they are still limited by high computational overhead, information loss, and robustness. Inspired by ResNet's residual learning, we propose Residual Mixture-of-Agents (RMoA), integrating residual connections to optimize efficiency and reliability. To maximize information utilization from model responses while minimizing computational costs, we innovatively design an embedding-based diversity selection mechanism that greedily selects responses via vector similarity. Furthermore, to mitigate iterative information degradation, we introduce a Residual Extraction Agent to preserve cross-layer incremental information by capturing inter-layer response differences, coupled with a Residual Aggregation Agent for hierarchical information integration. Additionally, we propose an adaptive termination mechanism that dynamically halts processing based on residual convergence, further improving inference efficiency. RMoA achieves state-of-the-art performance on the benchmarks of across alignment, mathematical reasoning, code generation, and multitasking understanding, while significantly reducing computational overhead. Code is available at https://github.com/mindhunter01/RMoA.

摘要

尽管基于大语言模型的多智能体系统在多项任务中展现出强大能力,但其仍受限于高计算开销、信息丢失和鲁棒性问题。受ResNet残差学习的启发,我们提出残差智能体混合架构(RMoA),通过集成残差连接来优化效率与可靠性。为在最大化模型响应信息利用率的同时最小化计算成本,我们创新性地设计了一种基于嵌入向量的多样性选择机制,通过向量相似度贪婪地筛选响应。此外,为缓解迭代过程中的信息退化问题,我们引入残差提取智能体来捕获层间响应差异以保留跨层增量信息,并结合残差聚合智能体实现层次化信息整合。我们还提出自适应终止机制,根据残差收敛情况动态停止处理,进一步提升推理效率。RMoA在指令对齐、数学推理、代码生成和多任务理解等基准测试中均达到最先进性能,同时显著降低了计算开销。代码已开源:https://github.com/mindhunter01/RMoA。


Random Rule Forest (RRF): Interpretable Ensembles of LLM-Generated Questions for Predicting Startup Success

Abstract

arXiv:2505.24622v1 Announce Type: new Abstract: Predicting startup success requires models that are both accurate and interpretable. We present a lightweight ensemble framework that combines YES/NO questions generated by large language models (LLMs), forming a transparent decision-making system. Each question acts as a weak heuristic, and by filtering, ranking, and aggregating them through a threshold-based voting mechanism, we construct a strong ensemble predictor. On a test set where 10% of startups are classified as successful, our approach achieves a precision rate of 50%, representing a 5x improvement over random selection, while remaining fully transparent. When we incorporate expert-guided heuristics into the generation process, performance improves further to 54% precision. These results highlight the value of combining LLM reasoning with human insight and demonstrate that simple, interpretable ensembles can support high-stakes decisions in domains such as venture capital (VC).

摘要

预测初创企业成功需要兼具准确性与可解释性的模型。我们提出一种轻量级集成框架,通过整合大型语言模型(LLMs)生成的二元问题,构建透明决策系统。每个问题作为弱启发式规则,经过基于阈值的投票机制进行筛选、排序和聚合后,形成强集成预测器。在成功初创企业占比10%的测试集中,该方法达到50%的精确率,较随机选择提升5倍,同时保持完全透明性。当引入专家指导的启发式规则至生成过程时,精确率进一步提升至54%。这些结果凸显了LLM推理与人类洞察相结合的价值,证明简单可解释的集成模型能够支持风险投资等高风险领域的决策。


SEAR: A Multimodal Dataset for Analyzing AR-LLM-Driven Social Engineering Behaviors

Abstract

arXiv:2505.24458v1 Announce Type: new Abstract: The SEAR Dataset is a novel multimodal resource designed to study the emerging threat of social engineering (SE) attacks orchestrated through augmented reality (AR) and multimodal large language models (LLMs). This dataset captures 180 annotated conversations across 60 participants in simulated adversarial scenarios, including meetings, classes and networking events. It comprises synchronized AR-captured visual/audio cues (e.g., facial expressions, vocal tones), environmental context, and curated social media profiles, alongside subjective metrics such as trust ratings and susceptibility assessments. Key findings reveal SEAR's alarming efficacy in eliciting compliance (e.g., 93.3% phishing link clicks, 85% call acceptance) and hijacking trust (76.7% post-interaction trust surge). The dataset supports research in detecting AR-driven SE attacks, designing defensive frameworks, and understanding multimodal adversarial manipulation. Rigorous ethical safeguards, including anonymization and IRB compliance, ensure responsible use. The SEAR dataset is available at https://github.com/INSLabCN/SEAR-Dataset.

摘要

SEAR数据集是一种新型多模态资源,旨在研究通过增强现实(AR)和多模态大语言模型(LLM)实施的社会工程(SE)攻击这一新兴威胁。该数据集收录了60名参与者在模拟对抗场景(包括会议、课堂和社交活动)中的180段标注对话,包含同步采集的AR视觉/音频线索(如面部表情、语调)、环境上下文、精选社交媒体资料,以及信任评级和易感性评估等主观指标。关键发现表明SEAR在诱导服从(93.3%的钓鱼链接点击率、85%的电话接听率)和劫持信任(76.7%的交互后信任激增)方面具有惊人效力。本数据集支持检测AR驱动的社会工程攻击、设计防御框架及理解多模态对抗操纵等研究。通过数据匿名化和机构审查委员会合规等严格伦理保障措施确保其负责任使用。SEAR数据集发布于https://github.com/INSLABCN/SEAR-Dataset。


Optimizing the Interface Between Knowledge Graphs and LLMs for Complex Reasoning

Abstract

arXiv:2505.24478v1 Announce Type: new Abstract: Integrating Large Language Models (LLMs) with Knowledge Graphs (KGs) results in complex systems with numerous hyperparameters that directly affect performance. While such systems are increasingly common in retrieval-augmented generation, the role of systematic hyperparameter optimization remains underexplored. In this paper, we study this problem in the context of Cognee, a modular framework for end-to-end KG construction and retrieval. Using three multi-hop QA benchmarks (HotPotQA, TwoWikiMultiHop, and MuSiQue) we optimize parameters related to chunking, graph construction, retrieval, and prompting. Each configuration is scored using established metrics (exact match, F1, and DeepEval's LLM-based correctness metric). Our results demonstrate that meaningful gains can be achieved through targeted tuning. While the gains are consistent, they are not uniform, with performance varying across datasets and metrics. This variability highlights both the value of tuning and the limitations of standard evaluation measures. While demonstrating the immediate potential of hyperparameter tuning, we argue that future progress will depend not only on architectural advances but also on clearer frameworks for optimization and evaluation in complex, modular systems.

摘要

将大语言模型(LLMs)与知识图谱(KGs)集成会形成具有众多直接影响性能的超参数的复杂系统。尽管此类系统在检索增强生成中日益普遍,但系统性超参数优化的作用仍未得到充分探索。本文以Cognee(一个端到端知识图谱构建与检索的模块化框架)为背景研究该问题。通过使用三个多跳问答基准数据集(HotPotQA、TwoWikiMultiHop和MuSiQue),我们优化了与文本分块、图谱构建、检索及提示工程相关的参数。每种配置均采用现有指标(精确匹配、F1值及DeepEval基于LLM的正确性指标)进行评分。结果表明,通过针对性调优可获得显著性能提升。虽然增益具有一致性,但不同数据集和指标间存在性能差异,这种差异性既凸显了调优的价值,也揭示了标准评估方法的局限性。在证明超参数调优即时潜力的同时,我们认为未来进展不仅取决于架构创新,更需建立针对复杂模块化系统的优化与评估的清晰框架。


How Much Backtracking is Enough? Exploring the Interplay of SFT and RL in Enhancing LLM Reasoning

Abstract

arXiv:2505.24273v1 Announce Type: new Abstract: Recent breakthroughs in large language models (LLMs) have effectively improved their reasoning abilities, particularly on mathematical and logical problems that have verifiable answers, through techniques such as supervised finetuning (SFT) and reinforcement learning (RL). Prior research indicates that RL effectively internalizes search strategies, enabling long chain-of-thought (CoT) reasoning, with backtracking emerging naturally as a learned capability. However, the precise benefits of backtracking, specifically, how significantly it contributes to reasoning improvements and the optimal extent of its use, remain poorly understood. In this work, we systematically investigate the dynamics between SFT and RL on eight reasoning tasks: Countdown, Sudoku, Arc 1D, Geometry, Color Cube Rotation, List Functions, Zebra Puzzles, and Self Reference. Our findings highlight that short CoT sequences used in SFT as a warm-up do have moderate contribution to RL training, compared with cold-start RL; however such contribution diminishes when tasks become increasingly difficult. Motivated by this observation, we construct synthetic datasets varying systematically in the number of backtracking steps and conduct controlled experiments to isolate the influence of either the correctness (content) or the structure (i.e., backtrack frequency). We find that (1) longer CoT with backtracks generally induce better and more stable RL training, (2) more challenging problems with larger search space tend to need higher numbers of backtracks during the SFT stage. Additionally, we demonstrate through experiments on distilled data that RL training is largely unaffected by the correctness of long CoT sequences, suggesting that RL prioritizes structural patterns over content correctness. Collectively, our results offer practical insights into designing optimal training strategies to effectively scale reasoning in LLMs.

摘要

大型语言模型(LLM)近期取得的突破性进展,通过监督微调(SFT)和强化学习(RL)等技术,有效提升了其在可验证答案的数学与逻辑问题上的推理能力。已有研究表明,RL能有效内化搜索策略,实现长链思维(CoT)推理,而回溯能力会自然习得。然而回溯的具体优势——尤其是其对推理改进的实际贡献度及最佳使用程度——仍缺乏深入理解。本研究系统探究了SFT与RL在八项推理任务(倒计时、数独、Arc一维问题、几何、立方体颜色旋转、列表函数、斑马谜题和自我参照)中的动态关系。研究发现:与冷启动RL相比,SFT阶段采用的短CoT序列确实对RL训练有适度贡献,但随着任务难度增加,这种贡献逐渐减弱。基于此发现,我们构建了回溯步数系统变化的合成数据集,通过控制实验分离了内容正确性与结构特征(即回溯频率)的影响。实验表明:(1)含回溯的长CoT通常能带来更好更稳定的RL训练效果;(2)搜索空间较大的复杂问题往往需要在SFT阶段使用更高频次回溯。此外,通过蒸馏数据的实验证明,RL训练基本不受长CoT序列内容正确性的影响,表明RL更关注结构模式而非内容准确性。这些发现为设计最优训练策略以有效扩展LLM的推理能力提供了实践指导。


Mixture-of-Experts for Personalized and Semantic-Aware Next Location Prediction

Abstract

arXiv:2505.24597v1 Announce Type: new Abstract: Next location prediction plays a critical role in understanding human mobility patterns. However, existing approaches face two core limitations: (1) they fall short in capturing the complex, multi-functional semantics of real-world locations; and (2) they lack the capacity to model heterogeneous behavioral dynamics across diverse user groups. To tackle these challenges, we introduce NextLocMoE, a novel framework built upon large language models (LLMs) and structured around a dual-level Mixture-of-Experts (MoE) design. Our architecture comprises two specialized modules: a Location Semantics MoE that operates at the embedding level to encode rich functional semantics of locations, and a Personalized MoE embedded within the Transformer backbone to dynamically adapt to individual user mobility patterns. In addition, we incorporate a history-aware routing mechanism that leverages long-term trajectory data to enhance expert selection and ensure prediction stability. Empirical evaluations across several real-world urban datasets show that NextLocMoE achieves superior performance in terms of predictive accuracy, cross-domain generalization, and interpretability

摘要

下一位置预测在理解人类移动模式中具有关键作用。然而现有方法存在两个核心局限:(1) 难以捕捉现实场景中地点复杂的多功能语义;(2) 缺乏对不同用户群体异构行为动态的建模能力。为解决这些问题,我们提出NextLocMoE框架,该框架基于大语言模型(LLM)构建,采用双层级混合专家(MoE)设计。架构包含两个专用模块:在嵌入层运作的"位置语义MoE"用于编码地点的丰富功能语义,以及嵌入Transformer主干网的"个性化MoE"动态适配个体移动模式。此外,我们引入历史感知路由机制,利用长期轨迹数据优化专家选择并确保预测稳定性。在多个真实城市数据集上的实证评估表明,NextLocMoE在预测精度、跨域泛化性和可解释性方面均表现出优越性能。


Leveraging Knowledge Graphs and LLMs for Structured Generation of Misinformation

Abstract

arXiv:2505.24479v1 Announce Type: new Abstract: The rapid spread of misinformation, further amplified by recent advances in generative AI, poses significant threats to society, impacting public opinion, democratic stability, and national security. Understanding and proactively assessing these threats requires exploring methodologies that enable structured and scalable misinformation generation. In this paper, we propose a novel approach that leverages knowledge graphs (KGs) as structured semantic resources to systematically generate fake triplets. By analyzing the structural properties of KGs, such as the distance between entities and their predicates, we identify plausibly false relationships. These triplets are then used to guide large language models (LLMs) in generating misinformation statements with varying degrees of credibility. By utilizing structured semantic relationships, our deterministic approach produces misinformation inherently challenging for humans to detect, drawing exclusively upon publicly available KGs (e.g., WikiGraphs). Additionally, we investigate the effectiveness of LLMs in distinguishing between genuine and artificially generated misinformation. Our analysis highlights significant limitations in current LLM-based detection methods, underscoring the necessity for enhanced detection strategies and a deeper exploration of inherent biases in generative models.

摘要

错误信息的迅速传播在生成式人工智能最新进展的推波助澜下,对社会构成重大威胁,影响公众舆论、民主稳定和国家安全。要理解并主动评估这些威胁,需要探索能够实现结构化、可扩展错误信息生成的方法论。本文提出一种创新方法,利用知识图谱(KGs)作为结构化语义资源来系统生成虚假三元组。通过分析知识图谱的结构特性(如实体间距离及其谓词关系),我们识别出具有潜在虚假性的关联关系。这些三元组随后用于指导大语言模型(LLMs)生成具有不同可信度的错误信息陈述。我们的确定性方法通过利用结构化语义关系,仅基于公开知识图谱(如WikiGraphs)即可生成人类难以识别的固有性错误信息。此外,我们探究了大语言模型在区分真实信息与人工生成错误信息方面的有效性。分析结果表明,当前基于LLM的检测方法存在显著局限性,这凸显了加强检测策略以及深入探索生成模型固有偏见的必要性。


MELT: Towards Automated Multimodal Emotion Data Annotation by Leveraging LLM Embedded Knowledge

Abstract

arXiv:2505.24493v1 Announce Type: new Abstract: Although speech emotion recognition (SER) has advanced significantly with deep learning, annotation remains a major hurdle. Human annotation is not only costly but also subject to inconsistencies annotators often have different preferences and may lack the necessary contextual knowledge, which can lead to varied and inaccurate labels. Meanwhile, Large Language Models (LLMs) have emerged as a scalable alternative for annotating text data. However, the potential of LLMs to perform emotional speech data annotation without human supervision has yet to be thoroughly investigated. To address these problems, we apply GPT-4o to annotate a multimodal dataset collected from the sitcom Friends, using only textual cues as inputs. By crafting structured text prompts, our methodology capitalizes on the knowledge GPT-4o has accumulated during its training, showcasing that it can generate accurate and contextually relevant annotations without direct access to multimodal inputs. Therefore, we propose MELT, a multimodal emotion dataset fully annotated by GPT-4o. We demonstrate the effectiveness of MELT by fine-tuning four self-supervised learning (SSL) backbones and assessing speech emotion recognition performance across emotion datasets. Additionally, our subjective experiments' results demonstrate a consistence performance improvement on SER.

摘要

尽管语音情感识别(SER)在深度学习推动下取得显著进展,但标注工作仍是主要障碍。人工标注不仅成本高昂,且存在不一致性问题——标注者往往具有不同偏好并可能缺乏必要的情境知识,这会导致标签存在差异且不准确。与此同时,大型语言模型(LLMs)已成为文本数据标注的可扩展替代方案。然而,LLMs在无需人工监督情况下完成语音情感数据标注的潜力尚未得到充分研究。为解决这些问题,我们应用GPT-4o对情景剧《老友记》收集的多模态数据集进行标注,仅使用文本线索作为输入。通过设计结构化文本提示,我们的方法充分利用了GPT-4o在训练过程中积累的知识,证明其无需接触多模态输入即可生成准确且符合情境的标注。据此我们提出MELT——首个完全由GPT-4o标注的多模态情感数据集。通过微调四个自监督学习(SSL)骨干网络并跨情感数据集评估语音情感识别性能,我们验证了MELT的有效性。此外,主观实验结果表明该方法能持续提升SER性能。


Adaptable Cardiovascular Disease Risk Prediction from Heterogeneous Data using Large Language Models

Abstract

arXiv:2505.24655v1 Announce Type: new Abstract: Cardiovascular disease (CVD) risk prediction models are essential for identifying high-risk individuals and guiding preventive actions. However, existing models struggle with the challenges of real-world clinical practice as they oversimplify patient profiles, rely on rigid input schemas, and are sensitive to distribution shifts. We developed AdaCVD, an adaptable CVD risk prediction framework built on large language models extensively fine-tuned on over half a million participants from the UK Biobank. In benchmark comparisons, AdaCVD surpasses established risk scores and standard machine learning approaches, achieving state-of-the-art performance. Crucially, for the first time, it addresses key clinical challenges across three dimensions: it flexibly incorporates comprehensive yet variable patient information; it seamlessly integrates both structured data and unstructured text; and it rapidly adapts to new patient populations using minimal additional data. In stratified analyses, it demonstrates robust performance across demographic, socioeconomic, and clinical subgroups, including underrepresented cohorts. AdaCVD offers a promising path toward more flexible, AI-driven clinical decision support tools suited to the realities of heterogeneous and dynamic healthcare environments.

摘要

心血管疾病(CVD)风险预测模型对于识别高风险个体和指导预防措施至关重要。然而,现有模型因过度简化患者特征、依赖固定输入模式且对分布变化敏感,难以应对真实临床实践中的挑战。我们开发了AdaCVD——一个基于大型语言模型的可适应CVD风险预测框架,该模型通过对英国生物银行50余万参与者数据进行深度微调构建。基准测试表明,AdaCVD超越了现有风险评分标准和传统机器学习方法,实现了最先进的性能。其关键突破在于首次从三个维度解决核心临床难题:灵活整合全面但多样化的患者信息;无缝融合结构化数据与非结构化文本;利用极少额外数据即可快速适应新患者群体。分层分析显示,该模型在人口统计学、社会经济和临床亚组(包括代表性不足人群)中均表现出稳健性能。AdaCVD为开发适应异构动态医疗环境的灵活AI临床决策支持工具提供了可行路径。


Towards Scalable Schema Mapping using Large Language Models

Abstract

arXiv:2505.24716v1 Announce Type: new Abstract: The growing need to integrate information from a large number of diverse sources poses significant scalability challenges for data integration systems. These systems often rely on manually written schema mappings, which are complex, source-specific, and costly to maintain as sources evolve. While recent advances suggest that large language models (LLMs) can assist in automating schema matching by leveraging both structural and natural language cues, key challenges remain. In this paper, we identify three core issues with using LLMs for schema mapping: (1) inconsistent outputs due to sensitivity to input phrasing and structure, which we propose methods to address through sampling and aggregation techniques; (2) the need for more expressive mappings (e.g., GLaV), which strain the limited context windows of LLMs; and (3) the computational cost of repeated LLM calls, which we propose to mitigate through strategies like data type prefiltering.

摘要

随着整合大量多样化来源信息的需求日益增长,数据集成系统面临着严峻的可扩展性挑战。这些系统通常依赖人工编写的模式映射,这些映射不仅复杂且与特定数据源绑定,还会因数据源演变而产生高昂的维护成本。尽管最新研究表明大型语言模型(LLMs)能通过结合结构特征与自然语言线索来自动化模式匹配,但仍存在关键性挑战。本文揭示了LLMs应用于模式映射时的三个核心问题:(1) 由于对输入措辞和结构的敏感性导致输出不一致,我们提出通过采样与聚合技术来解决;(2) 现有映射表达力不足(如GLaV)与LLMs有限上下文窗口的矛盾;(3) 重复调用LLM产生的计算开销,我们提出通过数据类型预过滤等策略进行优化。


Beyond the Black Box: Interpretability of LLMs in Finance

Abstract

arXiv:2505.24650v1 Announce Type: new Abstract: Large Language Models (LLMs) exhibit remarkable capabilities across a spectrum of tasks in financial services, including report generation, chatbots, sentiment analysis, regulatory compliance, investment advisory, financial knowledge retrieval, and summarization. However, their intrinsic complexity and lack of transparency pose significant challenges, especially in the highly regulated financial sector, where interpretability, fairness, and accountability are critical. As far as we are aware, this paper presents the first application in the finance domain of understanding and utilizing the inner workings of LLMs through mechanistic interpretability, addressing the pressing need for transparency and control in AI systems. Mechanistic interpretability is the most intuitive and transparent way to understand LLM behavior by reverse-engineering their internal workings. By dissecting the activations and circuits within these models, it provides insights into how specific features or components influence predictions - making it possible not only to observe but also to modify model behavior. In this paper, we explore the theoretical aspects of mechanistic interpretability and demonstrate its practical relevance through a range of financial use cases and experiments, including applications in trading strategies, sentiment analysis, bias, and hallucination detection. While not yet widely adopted, mechanistic interpretability is expected to become increasingly vital as adoption of LLMs increases. Advanced interpretability tools can ensure AI systems remain ethical, transparent, and aligned with evolving financial regulations. In this paper, we have put special emphasis on how these techniques can help unlock interpretability requirements for regulatory and compliance purposes - addressing both current needs and anticipating future expectations from financial regulators globally.

摘要

大型语言模型(LLMs)在金融服务领域展现出卓越的多任务处理能力,涵盖报告生成、聊天机器人、情感分析、合规监管、投资咨询、金融知识检索及文本摘要等场景。然而其固有复杂性与透明度缺失带来重大挑战,尤其在高度监管的金融行业,可解释性、公平性与问责机制至关重要。据我们所知,本文首次在金融领域应用机制可解释性方法来理解与利用LLMs内部工作机制,以应对AI系统透明性与可控性的迫切需求。机制可解释性通过逆向工程解析模型内部运作,是最直观透明的LLM行为理解方式——通过剖析模型内部的激活路径与电路机制,揭示特定特征或组件如何影响预测结果,从而实现不仅可观察更能修正模型行为的目标。本文系统探讨机制可解释性理论框架,并通过交易策略、情感分析、偏见检测及幻觉识别等金融应用案例实证其现实价值。尽管尚未大规模采用,随着LLMs应用普及,机制可解释性预计将愈发关键。先进的可解释性工具能确保AI系统符合伦理要求、保持透明度,并与动态演进的金融监管体系相协调。本文特别强调这些技术如何满足监管合规的 interpretability 要求——既应对当前需求,也前瞻性地响应全球金融监管机构的未来预期。


EXP-Bench: Can AI Conduct AI Research Experiments?

Abstract

arXiv:2505.24785v1 Announce Type: new Abstract: Automating AI research holds immense potential for accelerating scientific progress, yet current AI agents struggle with the complexities of rigorous, end-to-end experimentation. We introduce EXP-Bench, a novel benchmark designed to systematically evaluate AI agents on complete research experiments sourced from influential AI publications. Given a research question and incomplete starter code, EXP-Bench challenges AI agents to formulate hypotheses, design and implement experimental procedures, execute them, and analyze results. To enable the creation of such intricate and authentic tasks with high-fidelity, we design a semi-autonomous pipeline to extract and structure crucial experimental details from these research papers and their associated open-source code. With the pipeline, EXP-Bench curated 461 AI research tasks from 51 top-tier AI research papers. Evaluations of leading LLM-based agents, such as OpenHands and IterativeAgent on EXP-Bench demonstrate partial capabilities: while scores on individual experimental aspects such as design or implementation correctness occasionally reach 20-35%, the success rate for complete, executable experiments was a mere 0.5%. By identifying these bottlenecks and providing realistic step-by-step experiment procedures, EXP-Bench serves as a vital tool for future AI agents to improve their ability to conduct AI research experiments. EXP-Bench is open-sourced at https://github.com/Just-Curieous/Curie/tree/main/benchmark/exp_bench.

摘要

自动化人工智能研究对加速科学进步具有巨大潜力,但当前AI代理难以应对严谨端到端实验的复杂性。我们提出EXP-Bench这一新型基准,旨在基于有影响力的AI出版物中的完整研究实验,系统评估AI代理的能力。给定研究问题和不完整的初始代码,EXP-Bench要求AI代理完成假设构建、实验设计与实现、执行及结果分析全流程。为构建高保真度的复杂真实任务,我们设计了半自动化流程,从研究论文及其开源代码中提取并结构化关键实验细节。通过该流程,EXP-Bench从51篇顶级AI研究论文中精选出461个研究任务。对OpenHands、IterativeAgent等领先大模型代理的评估显示其部分能力:虽然设计或实现正确性等单项实验环节得分偶尔达到20-35%,但完整可执行实验的成功率仅为0.5%。通过揭示这些瓶颈并提供真实分步实验流程,EXP-Bench将成为提升AI代理科研实验能力的重要工具。项目已开源:https://github.com/Just-Curieous/Curie/tree/main/benchmark/exp_bench。


MiCRo: Mixture Modeling and Context-aware Routing for Personalized Preference Learning

Abstract

arXiv:2505.24846v1 Announce Type: new Abstract: Reward modeling is a key step in building safe foundation models when applying reinforcement learning from human feedback (RLHF) to align Large Language Models (LLMs). However, reward modeling based on the Bradley-Terry (BT) model assumes a global reward function, failing to capture the inherently diverse and heterogeneous human preferences. Hence, such oversimplification limits LLMs from supporting personalization and pluralistic alignment. Theoretically, we show that when human preferences follow a mixture distribution of diverse subgroups, a single BT model has an irreducible error. While existing solutions, such as multi-objective learning with fine-grained annotations, help address this issue, they are costly and constrained by predefined attributes, failing to fully capture the richness of human values. In this work, we introduce MiCRo, a two-stage framework that enhances personalized preference learning by leveraging large-scale binary preference datasets without requiring explicit fine-grained annotations. In the first stage, MiCRo introduces context-aware mixture modeling approach to capture diverse human preferences. In the second stage, MiCRo integrates an online routing strategy that dynamically adapts mixture weights based on specific context to resolve ambiguity, allowing for efficient and scalable preference adaptation with minimal additional supervision. Experiments on multiple preference datasets demonstrate that MiCRo effectively captures diverse human preferences and significantly improves downstream personalization.

摘要

奖励建模是应用人类反馈强化学习(RLHF)对齐大型语言模型(LLMs)时构建安全基础模型的关键步骤。然而,基于Bradley-Terry(BT)模型的奖励建模假设存在全局奖励函数,无法捕捉人类偏好固有的多样性和异质性。这种过度简化限制了LLMs支持个性化和多元化对齐的能力。理论上,我们证明当人类偏好遵循多样子群的混合分布时,单一BT模型存在不可约误差。现有解决方案(如基于细粒度标注的多目标学习)虽能缓解该问题,但其成本高昂且受限于预定义属性,无法完整体现人类价值观的丰富性。本研究提出MiCRo框架,通过利用无需显式细粒度标注的大规模二元偏好数据集,分两阶段增强个性化偏好学习:第一阶段采用上下文感知的混合建模方法捕捉多样化人类偏好;第二阶段集成在线路由策略,根据特定上下文动态调整混合权重以消除歧义,实现高效可扩展的偏好适配。多组偏好数据集实验表明,MiCRo能有效捕获多样化人类偏好,并显著提升下游个性化性能。


Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM Agents

Abstract

arXiv:2505.24878v1 Announce Type: new Abstract: CAPTCHAs have been a critical bottleneck for deploying web agents in real-world applications, often blocking them from completing end-to-end automation tasks. While modern multimodal LLM agents have demonstrated impressive performance in static perception tasks, their ability to handle interactive, multi-step reasoning challenges like CAPTCHAs is largely untested. To address this gap, we introduce Open CaptchaWorld, the first web-based benchmark and platform specifically designed to evaluate the visual reasoning and interaction capabilities of MLLM-powered agents through diverse and dynamic CAPTCHA puzzles. Our benchmark spans 20 modern CAPTCHA types, totaling 225 CAPTCHAs, annotated with a new metric we propose: CAPTCHA Reasoning Depth, which quantifies the number of cognitive and motor steps required to solve each puzzle. Experimental results show that humans consistently achieve near-perfect scores, state-of-the-art MLLM agents struggle significantly, with success rates at most 40.0% by Browser-Use Openai-o3, far below human-level performance, 93.3%. This highlights Open CaptchaWorld as a vital benchmark for diagnosing the limits of current multimodal agents and guiding the development of more robust multimodal reasoning systems. Code and Data are available at this https URL.

摘要

验证码(CAPTCHA)一直是网络代理在实际应用中部署的关键瓶颈,常常阻碍其完成端到端的自动化任务。尽管现代多模态大语言模型代理在静态感知任务中展现出卓越性能,但其处理交互式多步推理挑战(如验证码)的能力尚未得到充分验证。为填补这一空白,我们推出了Open CaptchaWorld——首个基于网络的基准测试平台,专门用于通过多样化动态验证码评估多模态大语言模型代理的视觉推理与交互能力。该基准涵盖20类现代验证码共计225个样本,并采用我们提出的新指标'验证码推理深度'进行标注,该指标量化了解决每个谜题所需的认知与操作步骤数。实验结果表明:人类参与者始终保持接近完美的准确率(93.3%),而最先进的多模态代理(Browser-Use Openai-o3)成功率最高仅达40.0%,远低于人类水平。这凸显了Open CaptchaWorld作为诊断当前多模态代理局限性的重要基准价值,可为开发更健壮的多模态推理系统提供指导。代码与数据详见此https URL。


Towards Natural Language Communication for Cooperative Autonomous Driving via Self-Play

Abstract

arXiv:2505.18334v1 Announce Type: cross Abstract: Past work has demonstrated that autonomous vehicles can drive more safely if they communicate with one another than if they do not. However, their communication has often not been human-understandable. Using natural language as a vehicle-to-vehicle (V2V) communication protocol offers the potential for autonomous vehicles to drive cooperatively not only with each other but also with human drivers. In this work, we propose a suite of traffic tasks in autonomous driving where vehicles in a traffic scenario need to communicate in natural language to facilitate coordination in order to avoid an imminent collision and/or support efficient traffic flow. To this end, this paper introduces a novel method, LLM+Debrief, to learn a message generation and high-level decision-making policy for autonomous vehicles through multi-agent discussion. To evaluate LLM agents for driving, we developed a gym-like simulation environment that contains a range of driving scenarios. Our experimental results demonstrate that LLM+Debrief is more effective at generating meaningful and human-understandable natural language messages to facilitate cooperation and coordination than a zero-shot LLM agent. Our code and demo videos are available at https://talking-vehicles.github.io/.

摘要

过往研究表明,具备车际通信能力的自动驾驶车辆比无通信功能的车辆行驶更安全。然而,现有车际通信系统往往缺乏人类可理解性。采用自然语言作为车对车(V2V)通信协议,不仅能使自动驾驶车辆实现相互协作,还可促进其与人类驾驶员的协同驾驶。本研究构建了一套自动驾驶交通任务集,要求交通场景中的车辆通过自然语言通信来实现协调避碰和/或优化交通流。为此,本文提出创新方法LLM+Debrief,通过多智能体讨论学习自动驾驶车辆的消息生成与高层决策策略。为评估语言大模型在驾驶中的表现,我们开发了包含多种驾驶场景的类gym仿真环境。实验结果表明,与零样本语言大模型智能体相比,LLM+Debrief方法能更有效地生成具有语义且人类可理解的自然语言消息以促进协作协调。代码及演示视频详见https://talking-vehicles.github.io/。


Few-Shot Optimization for Sensor Data Using Large Language Models: A Case Study on Fatigue Detection

Abstract

arXiv:2505.18754v1 Announce Type: cross Abstract: In this paper, we propose a novel few-shot optimization with HED-LM (Hybrid Euclidean Distance with Large Language Models) to improve example selection for sensor-based classification tasks. While few-shot prompting enables efficient inference with limited labeled data, its performance largely depends on the quality of selected examples. HED-LM addresses this challenge through a hybrid selection pipeline that filters candidate examples based on Euclidean distance and re-ranks them using contextual relevance scored by large language models (LLMs). To validate its effectiveness, we apply HED-LM to a fatigue detection task using accelerometer data characterized by overlapping patterns and high inter-subject variability. Unlike simpler tasks such as activity recognition, fatigue detection demands more nuanced example selection due to subtle differences in physiological signals. Our experiments show that HED-LM achieves a mean macro F1-score of 69.13±\pm10.71%, outperforming both random selection (59.30±\pm10.13%) and distance-only filtering (67.61±\pm11.39%). These represent relative improvements of 16.6% and 2.3%, respectively. The results confirm that combining numerical similarity with contextual relevance improves the robustness of few-shot prompting. Overall, HED-LM offers a practical solution to improve performance in real-world sensor-based learning tasks and shows potential for broader applications in healthcare monitoring, human activity recognition, and industrial safety scenarios.

摘要

本文提出了一种基于HED-LM(混合欧氏距离与大语言模型)的新型小样本优化方法,用于改进基于传感器的分类任务中的样本选择。尽管小样本提示技术能够在有限标注数据下实现高效推理,但其性能很大程度上取决于所选样本的质量。HED-LM通过混合选择流程应对这一挑战:首先基于欧氏距离筛选候选样本,随后利用大语言模型(LLMs)评分的上下文相关性进行重排序。为验证其有效性,我们将HED-LM应用于加速度计数据的疲劳检测任务,该任务具有模式重叠和主体间高变异性的特征。与活动识别等简单任务不同,由于生理信号的细微差异,疲劳检测需要更精细的样本选择。实验结果表明,HED-LM的平均宏观F1分数达到69.13±10.71%,优于随机选择(59.30±10.13%)和纯距离过滤(67.61±11.39%),分别实现了16.6%和2.3%的相对提升。这些结果证实了数值相似性与上下文相关性相结合能增强小样本提示的鲁棒性。总体而言,HED-LM为提升现实世界传感器学习任务性能提供了实用解决方案,在健康监测、人类活动识别和工业安全场景中展现出更广泛的应用潜力。


Mind the Gap: A Practical Attack on GGUF Quantization

Abstract

arXiv:2505.23786v1 Announce Type: cross Abstract: With the increasing size of frontier LLMs, post-training quantization has become the standard for memory-efficient deployment. Recent work has shown that basic rounding-based quantization schemes pose security risks, as they can be exploited to inject malicious behaviors into quantized models that remain hidden in full precision. However, existing attacks cannot be applied to more complex quantization methods, such as the GGUF family used in the popular ollama and llama.cpp frameworks. In this work, we address this gap by introducing the first attack on GGUF. Our key insight is that the quantization error -- the difference between the full-precision weights and their (de-)quantized version -- provides sufficient flexibility to construct malicious quantized models that appear benign in full precision. Leveraging this, we develop an attack that trains the target malicious LLM while constraining its weights based on quantization errors. We demonstrate the effectiveness of our attack on three popular LLMs across nine GGUF quantization data types on three diverse attack scenarios: insecure code generation (Δ\Delta=88.7%88.7\%), targeted content injection (Δ\Delta=85.0%85.0\%), and benign instruction refusal (Δ\Delta=30.1%30.1\%). Our attack highlights that (1) the most widely used post-training quantization method is susceptible to adversarial interferences, and (2) the complexity of quantization schemes alone is insufficient as a defense.

摘要

随着前沿大型语言模型(LLM)规模的不断扩大,训练后量化已成为内存高效部署的标准方案。最新研究表明,基于舍入的基本量化方案存在安全风险,攻击者可利用其在量化模型中植入恶意行为,而这些行为在全精度模型中仍保持隐蔽。然而,现有攻击方法无法应用于更复杂的量化方案,例如ollama和llama.cpp等流行框架采用的GGUF系列量化方法。本研究通过首次提出针对GGUF的攻击填补了这一空白。我们的核心发现是:量化误差(即全精度权重与其(反)量化版本之间的差异)为构建全精度下表现正常但包含恶意的量化模型提供了足够的操作空间。基于此,我们开发了一种在量化误差约束下训练目标恶意LLM的攻击方法。通过在三种流行LLM、九种GGUF量化数据类型上针对三种攻击场景(不安全代码生成攻击成功率提升Δ=88.7%、定向内容注入Δ=85.0%、良性指令拒绝Δ=30.1%)的验证,证明了该攻击的有效性。本研究揭示:(1)当前最广泛采用的训练后量化方法仍易受对抗性干扰;(2)仅依靠量化方案的复杂性不足以构成有效防御。


Meaning Is Not A Metric: Using LLMs to make cultural context legible at scale

Abstract

arXiv:2505.23785v1 Announce Type: cross Abstract: This position paper argues that large language models (LLMs) can make cultural context, and therefore human meaning, legible at an unprecedented scale in AI-based sociotechnical systems. We argue that such systems have previously been unable to represent human meaning because they rely on thin descriptions: numerical representations that enforce standardization and therefore strip human activity of the cultural context that gives it meaning. By contrast, scholars in the humanities and qualitative social sciences have developed frameworks for representing meaning through thick description: verbal representations that accommodate heterogeneity and retain contextual information needed to represent human meaning. While these methods can effectively codify meaning, they are difficult to deploy at scale. However, the verbal capabilities of LLMs now provide a means of (at least partially) automating the generation and processing of thick descriptions, potentially overcoming this bottleneck. We argue that the problem of rendering human meaning legible is not just about selecting better metrics, but about developing new representational formats (based on thick description). We frame this as a crucial direction for the application of generative AI and identify five key challenges: preserving context, maintaining interpretive pluralism, integrating perspectives based on lived experience and critical distance, distinguishing qualitative content from quantitative magnitude, and acknowledging meaning as dynamic rather than static. Furthermore, we suggest that thick description has the potential to serve as a unifying framework to address a number of emerging concerns about the difficulties of representing culture in (or using) LLMs.

摘要

本立场文件提出,大型语言模型(LLMs)能够在基于人工智能的社会技术系统中以前所未有的规模呈现文化语境,从而使人类意义变得可解读。我们认为,此类系统过去之所以无法表征人类意义,是因为其依赖于"薄描述"——这种数值化表征强制标准化,从而剥离了赋予人类活动意义的文化语境。相比之下,人文与定性社会科学学者已发展出通过"厚描述"表征意义的框架:这种语言表征方式能容纳异质性并保留呈现人类意义所需的语境信息。尽管这些方法能有效编码意义,却难以大规模应用。然而,LLMs的语言能力目前提供了(至少部分)自动化生成和处理厚描述的手段,有望突破这一瓶颈。我们指出,人类意义可读化问题不仅关乎选择更优度量标准,更在于开发基于厚描述的新型表征形式。本文将之定位为生成式AI应用的关键方向,并提出五大核心挑战:语境保存、阐释多元性维护、生活经验与批判距离视角的整合、质性内容与量化程度的区分,以及意义动态性的认知。此外,我们认为厚描述有望成为统一框架,用以解决LLMs中(或使用LLMs时)文化表征困难的若干新兴问题。


Boosting In-Context Learning in LLMs Through the Lens of Classical Supervised Learning

Abstract

arXiv:2505.23783v1 Announce Type: cross Abstract: In-Context Learning (ICL) allows Large Language Models (LLMs) to adapt to new tasks with just a few examples, but their predictions often suffer from systematic biases, leading to unstable performances in classification. While calibration techniques are proposed to mitigate these biases, we show that, in the logit space, many of these methods are equivalent to merely shifting the LLM's decision boundary without having the ability to alter its orientation. This proves inadequate when biases cause the LLM to be severely misdirected. To address these limitations and provide a unifying framework, we propose Supervised Calibration (SC), a loss-minimization based framework which learns an optimal, per-class affine transformation of the LLM's predictive probabilities in the logit space without requiring external data beyond the context. By using a more expressive functional class, SC not only subsumes many existing calibration methods in ICL as special cases, but also enables the ability to alter and even completely reverse the orientation of the LLM's decision boundary. Furthermore, SC's loss-based nature facilitates the seamless integration of two purpose-built regularization techniques: context-invariance and directional trust-region. The former is designed to tackle the instability issue in ICL, while the latter controls the degree of calibration. Finally, SC delivers state-of-the-art performance over calibration baselines in the 4-shot, 8-shot, and 16-shot settings across all nine datasets for Mistral-7B-Instruct-v0.3, LLaMA-2-7B-chat, and Qwen2-7B-Instruct.

摘要

上下文学习(ICL)使大型语言模型(LLM)仅需少量示例即可适应新任务,但其预测常受系统性偏差影响,导致分类性能不稳定。尽管已有校准技术被提出以缓解这些偏差,但我们证明在逻辑空间中,许多此类方法仅相当于平移LLM的决策边界,而无法改变其方向。当偏差导致LLM严重偏离时,这种校准显然不足。为突破这些限制并建立统一框架,我们提出监督校准(SC)——一种基于损失最小化的框架,可在无需上下文外数据的情况下,学习LLM预测概率在逻辑空间中每类的最优仿射变换。通过采用更具表达力的函数类,SC不仅将ICL中多种现有校准方法纳入为特例,还能调整甚至完全逆转LLM决策边界的方向。此外,SC的损失函数特性便于无缝集成两种定制正则化技术:上下文不变性和方向性信任区域。前者针对ICL的不稳定性问题设计,后者则控制校准程度。最终,在Mistral-7B-Instruct-v0.3、LLaMA-2-7B-chat和Qwen2-7B-Instruct模型上,SC于4样本、8样本和16样本设置下的全部九个数据集中均实现了超越基线校准方法的先进性能。


Rethinking the Understanding Ability across LLMs through Mutual Information

Abstract

arXiv:2505.23790v1 Announce Type: cross Abstract: Recent advances in large language models (LLMs) have revolutionized natural language processing, yet evaluating their intrinsic linguistic understanding remains challenging. Moving beyond specialized evaluation tasks, we propose an information-theoretic framework grounded in mutual information (MI) to achieve this. We formalize the understanding as MI between an input sentence and its latent representation (sentence-level MI), measuring how effectively input information is preserved in latent representation. Given that LLMs learn embeddings for individual tokens, we decompose sentence-level MI into token-level MI between tokens and sentence embeddings, establishing theoretical bounds connecting these measures. Based on this foundation, we theoretically derive a computable lower bound for token-level MI using Fano's inequality, which directly relates to token-level recoverability-the ability to predict original tokens from sentence embedding. We implement this recoverability task to comparatively measure MI across different LLMs, revealing that encoder-only models consistently maintain higher information fidelity than their decoder-only counterparts, with the latter exhibiting a distinctive late-layer "forgetting" pattern where mutual information is first enhanced and then discarded. Moreover, fine-tuning to maximize token-level recoverability consistently improves understanding ability of LLMs on tasks without task-specific supervision, demonstrating that mutual information can serve as a foundation for understanding and improving language model capabilities.

摘要

大型语言模型(LLMs)的最新进展彻底改变了自然语言处理领域,但评估其内在语言理解能力仍具挑战性。本文突破传统专项评估任务的局限,提出基于互信息(MI)的信息理论框架来实现这一目标。我们将语言理解形式化为输入句子与其潜在表征之间的互信息(句子级MI),用以衡量输入信息在潜在表征中的保存效率。鉴于LLMs为单个词元学习嵌入表示,我们将句子级MI分解为词元与句子嵌入间的词元级MI,并建立连接这些度量的理论边界。在此基础上,通过法诺不等式理论推导出词元级MI的可计算下界,该下界直接关联词元级可恢复性——即从句子嵌入预测原始词元的能力。我们通过实现该可恢复性任务,对不同LLMs的MI进行对比测量,发现仅编码器模型始终比仅解码器模型保持更高的信息保真度,后者呈现出独特的深层"遗忘"模式:互信息先增强后被丢弃。此外,在无任务特定监督的情况下,通过微调最大化词元级可恢复性能够持续提升LLMs在各项任务中的理解能力,这证明互信息可作为理解和改进语言模型能力的基础。


Estimating LLM Consistency: A User Baseline vs Surrogate Metrics

Abstract

arXiv:2505.23799v1 Announce Type: cross Abstract: Large language models (LLMs) are prone to hallucinations and sensitive to prompt perturbations, often resulting in inconsistent or unreliable generated text. Different methods have been proposed to mitigate such hallucinations and fragility -- one of them being measuring the consistency (the model's confidence in the response, or likelihood of generating a similar response when resampled) of LLM responses. In previous work, measuring consistency often relied on the probability of a response appearing within a pool of resampled responses, or internal states or logits of responses. However, it is not yet clear how well these approaches approximate how humans perceive the consistency of LLM responses. We performed a user study (n=2,976) and found current methods typically do not approximate users' perceptions of LLM consistency very well. We propose a logit-based ensemble method for estimating LLM consistency, and we show that this method matches the performance of the best-performing existing metric in estimating human ratings of LLM consistency. Our results suggest that methods of estimating LLM consistency without human evaluation are sufficiently imperfect that we suggest evaluation with human input be more broadly used.

摘要

大型语言模型(LLMs)容易产生幻觉且对提示扰动敏感,常导致生成文本不一致或不可靠。已有多种方法被提出以缓解此类幻觉和脆弱性——其中之一是通过测量LLM响应的一致性(模型对回答的置信度,或重新采样时生成相似回答的可能性)。在以往研究中,一致性测量通常依赖于回答在重新采样回答池中出现的概率,或回答的内部状态或逻辑值。然而,这些方法在多大程度上能近似人类对LLM响应一致性的感知尚不明确。我们开展了一项用户研究(n=2,976),发现现有方法通常难以很好地近似用户对LLM一致性的感知。我们提出了一种基于逻辑值的集成方法来估计LLM一致性,并证明该方法在评估人类对LLM一致性评分方面与现有最佳性能指标表现相当。研究结果表明,未经人类评估的LLM一致性估计方法存在明显缺陷,因此我们建议更广泛地采用结合人类输入的评估方式。


R3-RAG: Learning Step-by-Step Reasoning and Retrieval for LLMs via Reinforcement Learning

Abstract

arXiv:2505.23794v1 Announce Type: cross Abstract: Retrieval-Augmented Generation (RAG) integrates external knowledge with Large Language Models (LLMs) to enhance factual correctness and mitigate hallucination. However, dense retrievers often become the bottleneck of RAG systems due to their limited parameters compared to LLMs and their inability to perform step-by-step reasoning. While prompt-based iterative RAG attempts to address these limitations, it is constrained by human-designed workflows. To address these limitations, we propose \textbf&#123;R3-RAG&#125;, which uses \textbf&#123;R&#125;einforcement learning to make the LLM learn how to \textbf&#123;R&#125;eason and \textbf&#123;R&#125;etrieve step by step, thus retrieving comprehensive external knowledge and leading to correct answers. R3-RAG is divided into two stages. We first use cold start to make the model learn the manner of iteratively interleaving reasoning and retrieval. Then we use reinforcement learning to further harness its ability to better explore the external retrieval environment. Specifically, we propose two rewards for R3-RAG: 1) answer correctness for outcome reward, which judges whether the trajectory leads to a correct answer; 2) relevance-based document verification for process reward, encouraging the model to retrieve documents that are relevant to the user question, through which we can let the model learn how to iteratively reason and retrieve relevant documents to get the correct answer. Experimental results show that R3-RAG significantly outperforms baselines and can transfer well to different retrievers. We release R3-RAG at https://github.com/Yuan-Li-FNLP/R3-RAG.

摘要

检索增强生成(RAG)通过整合外部知识与大语言模型(LLMs)来提升事实准确性并减少幻觉现象。然而,稠密检索器由于参数量远小于LLMs且无法执行逐步推理,往往成为RAG系统的性能瓶颈。虽然基于提示的迭代式RAG尝试解决这些限制,但其仍受限于人工设计的工作流程。为此,我们提出 extbf&#123;R3-RAG&#125;框架,利用 extbf&#123;强化学习&#125;使LLM学会 extbf&#123;逐步推理&#125; extbf&#123;检索&#125;,从而获取全面外部知识并得出正确答案。R3-RAG分为两个阶段:首先通过冷启动使模型掌握交替进行推理与检索的迭代模式,随后运用强化学习进一步强化其探索外部检索环境的能力。具体而言,我们设计双重奖励机制:1)作为结果奖励的答案正确性评估,判断轨迹是否导向正确答案;2)作为过程奖励的基于相关性的文档验证,激励模型检索与用户问题相关的文档,从而使其学会通过迭代推理检索相关文档以获得正确答案。实验表明R3-RAG显著优于基线模型,并能良好适配不同检索器。项目代码已发布于https://github.com/Yuan-Li-FNLP/R3-RAG。


USB: A Comprehensive and Unified Safety Evaluation Benchmark for Multimodal Large Language Models

Abstract

arXiv:2505.23793v1 Announce Type: cross Abstract: Despite their remarkable achievements and widespread adoption, Multimodal Large Language Models (MLLMs) have revealed significant security vulnerabilities, highlighting the urgent need for robust safety evaluation benchmarks. Existing MLLM safety benchmarks, however, fall short in terms of data quality and coverge, and modal risk combinations, resulting in inflated and contradictory evaluation results, which hinders the discovery and governance of security concerns. Besides, we argue that vulnerabilities to harmful queries and oversensitivity to harmless ones should be considered simultaneously in MLLMs safety evaluation, whereas these were previously considered separately. In this paper, to address these shortcomings, we introduce Unified Safety Benchmarks (USB), which is one of the most comprehensive evaluation benchmarks in MLLM safety. Our benchmark features high-quality queries, extensive risk categories, comprehensive modal combinations, and encompasses both vulnerability and oversensitivity evaluations. From the perspective of two key dimensions: risk categories and modality combinations, we demonstrate that the available benchmarks -- even the union of the vast majority of them -- are far from being truly comprehensive. To bridge this gap, we design a sophisticated data synthesis pipeline that generates extensive, high-quality complementary data addressing previously unexplored aspects. By combining open-source datasets with our synthetic data, our benchmark provides 4 distinct modality combinations for each of the 61 risk sub-categories, covering both English and Chinese across both vulnerability and oversensitivity dimensions.

摘要

尽管多模态大语言模型(MLLMs)取得了显著成就并得到广泛应用,但其暴露出的重大安全漏洞凸显了对鲁棒性安全评估基准的迫切需求。现有MLLM安全基准在数据质量、覆盖范围及模态风险组合方面存在不足,导致评估结果被夸大且相互矛盾,阻碍了安全问题的发现与治理。此外,我们认为有害查询的脆弱性与无害查询的过度敏感性应被同步纳入MLLM安全评估,而此前这两者被割裂考量。本文针对这些缺陷提出"统一安全基准"(USB),这是目前最全面的MLLM安全评估基准之一。该基准具有高质量查询、广泛风险类别、完整模态组合等特点,同时涵盖脆弱性与过度敏感性评估。从风险类别和模态组合两个关键维度分析,我们发现现有基准——即便整合绝大多数现有资源——仍远未达到真正全面。为弥补这一缺口,我们设计了复杂的数据合成流程,生成大量高质量补充数据以覆盖既往未探索的领域。通过整合开源数据集与合成数据,本基准为61个风险子类别分别提供4种模态组合方案,涵盖中英双语在脆弱性与过度敏感性双维度的评估。


Abstract

arXiv:2505.23788v1 Announce Type: cross Abstract: Large language models (LLMs) commonly risk copyright infringement by reproducing protected content verbatim or with insufficient transformative modifications, posing significant ethical, legal, and practical concerns. Current inference-time safeguards predominantly rely on restrictive refusal-based filters, often compromising the practical utility of these models. To address this, we collaborated closely with intellectual property experts to develop FUA-LLM (Fair Use Aligned Language Models), a legally-grounded framework explicitly designed to align LLM outputs with fair-use doctrine. Central to our method is FairUseDB, a carefully constructed dataset containing 18,000 expert-validated examples covering nine realistic infringement scenarios. Leveraging this dataset, we apply Direct Preference Optimization (DPO) to fine-tune open-source LLMs, encouraging them to produce legally compliant and practically useful alternatives rather than resorting to blunt refusal. Recognizing the shortcomings of traditional evaluation metrics, we propose new measures: Weighted Penalty Utility and Compliance Aware Harmonic Mean (CAH) to balance infringement risk against response utility. Extensive quantitative experiments coupled with expert evaluations confirm that FUA-LLM substantially reduces problematic outputs (up to 20%) compared to state-of-the-art approaches, while preserving real-world usability.

摘要

大型语言模型(LLMs)通常存在侵犯版权的风险,其会逐字复述受保护内容或进行变革性不足的修改,从而引发重大的伦理、法律和实际问题。当前推理阶段的防护措施主要依赖基于拒绝的限制性过滤器,这往往损害了模型的实际效用。为解决这一问题,我们与知识产权专家紧密合作,开发了FUA-LLM(合理使用对齐语言模型),这是一个基于法律基础的框架,专门设计用于使LLM输出符合合理使用原则。我们方法的核心是FairUseDB,这是一个精心构建的数据集,包含18,000个经专家验证的示例,涵盖九种现实侵权场景。利用该数据集,我们应用直接偏好优化(DPO)对开源LLMs进行微调,促使模型生成合法合规且具有实际效用的替代方案,而非简单粗暴地拒绝。针对传统评估指标的不足,我们提出了新指标:加权惩罚效用(Weighted Penalty Utility)和合规感知调和平均数(CAH),以平衡侵权风险与响应效用。大量定量实验结合专家评估证实,与现有最优方法相比,FUA-LLM显著减少了问题输出(高达20%),同时保持了实际可用性。


LLM-Driven E-Commerce Marketing Content Optimization: Balancing Creativity and Conversion

Abstract

arXiv:2505.23809v1 Announce Type: cross Abstract: As e-commerce competition intensifies, balancing creative content with conversion effectiveness becomes critical. Leveraging LLMs' language generation capabilities, we propose a framework that integrates prompt engineering, multi-objective fine-tuning, and post-processing to generate marketing copy that is both engaging and conversion-driven. Our fine-tuning method combines sentiment adjustment, diversity enhancement, and CTA embedding. Through offline evaluations and online A/B tests across categories, our approach achieves a 12.5 % increase in CTR and an 8.3 % increase in CVR while maintaining content novelty. This provides a practical solution for automated copy generation and suggests paths for future multimodal, real-time personalization.

摘要

随着电商竞争加剧,平衡内容创意与转化效果变得至关重要。本研究利用大语言模型的文本生成能力,提出一个融合提示工程、多目标微调与后处理的框架,以生成兼具吸引力与转化驱动的营销文案。我们的微调方法整合了情感调节、多样性增强及行动号召嵌入三项技术。通过跨品类的离线评估与在线A/B测试,该方法在保持内容新颖性的同时实现点击率提升12.5%、转化率提升8.3%。这为自动化文案生成提供了实用解决方案,并为未来多模态实时个性化发展指明了路径。


My Answer Is NOT 'Fair': Mitigating Social Bias in Vision-Language Models via Fair and Biased Residuals

Abstract

arXiv:2505.23798v1 Announce Type: cross Abstract: Social bias is a critical issue in large vision-language models (VLMs), where fairness- and ethics-related problems harm certain groups of people in society. It is unknown to what extent VLMs yield social bias in generative responses. In this study, we focus on evaluating and mitigating social bias on both the model's response and probability distribution. To do so, we first evaluate four state-of-the-art VLMs on PAIRS and SocialCounterfactuals datasets with the multiple-choice selection task. Surprisingly, we find that models suffer from generating gender-biased or race-biased responses. We also observe that models are prone to stating their responses are fair, but indeed having mis-calibrated confidence levels towards particular social groups. While investigating why VLMs are unfair in this study, we observe that VLMs' hidden layers exhibit substantial fluctuations in fairness levels. Meanwhile, residuals in each layer show mixed effects on fairness, with some contributing positively while some lead to increased bias. Based on these findings, we propose a post-hoc method for the inference stage to mitigate social bias, which is training-free and model-agnostic. We achieve this by ablating bias-associated residuals while amplifying fairness-associated residuals on model hidden layers during inference. We demonstrate that our post-hoc method outperforms the competing training strategies, helping VLMs have fairer responses and more reliable confidence levels.

摘要

社会偏见是大型视觉-语言模型(VLMs)中的一个关键问题,与公平性和伦理相关的问题会损害社会中的特定群体。目前尚不清楚VLMs在生成响应中产生社会偏见的程度。在本研究中,我们重点评估并缓解模型响应和概率分布上的社会偏见。为此,我们首先在PAIRS和SocialCounterfactuals数据集上通过多项选择任务评估了四种最先进的VLMs。令人惊讶的是,我们发现这些模型容易生成带有性别或种族偏见的响应。我们还观察到,模型倾向于声称其响应是公平的,但实际上对特定社会群体的置信水平存在校准错误。在研究VLMs为何不公平的过程中,我们发现VLMs的隐藏层在公平性水平上表现出显著波动。同时,每一层的残差对公平性表现出混合效应,部分残差对公平性有积极贡献,而另一些则导致偏见增加。基于这些发现,我们提出了一种推理阶段的后处理方法以缓解社会偏见,该方法无需训练且与模型无关。我们通过在推理过程中消除与偏见相关的残差,同时放大隐藏层中与公平性相关的残差来实现这一目标。实验表明,我们的后处理方法优于其他训练策略,有助于VLMs生成更公平的响应和更可靠的置信水平。


Calibrating LLMs for Text-to-SQL Parsing by Leveraging Sub-clause Frequencies

Abstract

arXiv:2505.23804v1 Announce Type: cross Abstract: While large language models (LLMs) achieve strong performance on text-to-SQL parsing, they sometimes exhibit unexpected failures in which they are confidently incorrect. Building trustworthy text-to-SQL systems thus requires eliciting reliable uncertainty measures from the LLM. In this paper, we study the problem of providing a calibrated confidence score that conveys the likelihood of an output query being correct. Our work is the first to establish a benchmark for post-hoc calibration of LLM-based text-to-SQL parsing. In particular, we show that Platt scaling, a canonical method for calibration, provides substantial improvements over directly using raw model output probabilities as confidence scores. Furthermore, we propose a method for text-to-SQL calibration that leverages the structured nature of SQL queries to provide more granular signals of correctness, named "sub-clause frequency" (SCF) scores. Using multivariate Platt scaling (MPS), our extension of the canonical Platt scaling technique, we combine individual SCF scores into an overall accurate and calibrated score. Empirical evaluation on two popular text-to-SQL datasets shows that our approach of combining MPS and SCF yields further improvements in calibration and the related task of error detection over traditional Platt scaling.

摘要

尽管大语言模型(LLMs)在文本到SQL解析任务中表现出色,但它们有时会出现自信错误的意外故障。因此,构建可信的文本到SQL系统需要从LLM中获取可靠的不确定性度量。本文研究了如何提供校准后的置信分数,以反映输出查询正确的可能性。我们的工作首次为基于LLM的文本到SQL解析建立了事后校准基准。特别地,我们发现普拉特缩放(Platt scaling)这一经典校准方法,相较于直接使用原始模型输出概率作为置信分数,能带来显著改进。此外,我们提出了一种利用SQL查询结构化特性的校准方法,通过"子句频率"(SCF)分数提供更细粒度的正确性信号。采用多元普拉特缩放(MPS)——我们对经典普拉特缩放技术的扩展——将各个SCF分数组合成整体准确且校准的分数。在两个主流文本到SQL数据集上的实证评估表明,结合MPS与SCF的方法在校准及相关错误检测任务上,较传统普拉特缩放实现了进一步优化。


DLP: Dynamic Layerwise Pruning in Large Language Models

Abstract

arXiv:2505.23807v1 Announce Type: cross Abstract: Pruning has recently been widely adopted to reduce the parameter scale and improve the inference efficiency of Large Language Models (LLMs). Mainstream pruning techniques often rely on uniform layerwise pruning strategies, which can lead to severe performance degradation at high sparsity levels. Recognizing the varying contributions of different layers in LLMs, recent studies have shifted their focus toward non-uniform layerwise pruning. However, these approaches often rely on pre-defined values, which can result in suboptimal performance. To overcome these limitations, we propose a novel method called Dynamic Layerwise Pruning (DLP). This approach adaptively determines the relative importance of each layer by integrating model weights with input activation information, assigning pruning rates accordingly. Experimental results show that DLP effectively preserves model performance at high sparsity levels across multiple LLMs. Specifically, at 70% sparsity, DLP reduces the perplexity of LLaMA2-7B by 7.79 and improves the average accuracy by 2.7% compared to state-of-the-art methods. Moreover, DLP is compatible with various existing LLM compression techniques and can be seamlessly integrated into Parameter-Efficient Fine-Tuning (PEFT). We release the code at https://github.com/ironartisan/DLP to facilitate future research.

摘要

剪枝技术近年来被广泛采用以缩减大语言模型(LLM)的参数规模并提升推理效率。主流剪枝方法通常采用统一的逐层剪枝策略,这在较高稀疏度下会导致严重的性能下降。鉴于LLM中不同层的贡献度存在差异,近期研究开始转向非均匀的逐层剪枝方案。然而这些方法往往依赖预设值,可能导致次优性能。为突破这些限制,我们提出了一种称为动态逐层剪枝(DLP)的新方法。该方法通过整合模型权重与输入激活信息,自适应地确定各层相对重要性并分配剪枝率。实验结果表明,DLP在多种LLM的高稀疏度下均能有效保持模型性能。具体而言,在70%稀疏度时,相较于最先进方法,DLP将LLaMA2-7B的困惑度降低7.79,平均准确率提升2.7%。此外,DLP兼容多种现有LLM压缩技术,并可无缝集成至参数高效微调(PEFT)框架。我们在https://github.com/ironartisan/DLP发布了代码以促进后续研究。


MedOrchestra: A Hybrid Cloud-Local LLM Approach for Clinical Data Interpretation

Abstract

arXiv:2505.23806v1 Announce Type: cross Abstract: Deploying large language models (LLMs) in clinical settings faces critical trade-offs: cloud LLMs, with their extensive parameters and superior performance, pose risks to sensitive clinical data privacy, while local LLMs preserve privacy but often fail at complex clinical interpretation tasks. We propose MedOrchestra, a hybrid framework where a cloud LLM decomposes complex clinical tasks into manageable subtasks and prompt generation, while a local LLM executes these subtasks in a privacy-preserving manner. Without accessing clinical data, the cloud LLM generates and validates subtask prompts using clinical guidelines and synthetic test cases. The local LLM executes subtasks locally and synthesizes outputs generated by the cloud LLM. We evaluate MedOrchestra on pancreatic cancer staging using 100 radiology reports under NCCN guidelines. On free-text reports, MedOrchestra achieves 70.21% accuracy, outperforming local model baselines (without guideline: 48.94%, with guideline: 56.59%) and board-certified clinicians (gastroenterologists: 59.57%, surgeons: 65.96%, radiologists: 55.32%). On structured reports, MedOrchestra reaches 85.42% accuracy, showing clear superiority across all settings.

摘要

在临床环境中部署大型语言模型(LLMs)面临关键权衡:云端LLMs凭借其庞大参数量和卓越性能,却对敏感临床数据隐私构成风险;而本地LLMs虽能保护隐私,却常在复杂临床解读任务中表现不佳。我们提出MedOrchestra混合框架,通过云端LLM将复杂临床任务分解为可管理的子任务和提示生成,同时由本地LLM以隐私保护方式执行这些子任务。云端LLM无需接触临床数据,仅通过临床指南和合成测试案例生成并验证子任务提示。本地LLM在本地执行子任务,并整合云端LLM生成的输出。我们根据NCCN指南对100份胰腺癌分期放射学报告进行评估。在自由文本报告中,MedOrchestra达到70.21%准确率,优于本地模型基线(无指南:48.94%,有指南:56.59%)和具备委员会认证的临床医师(胃肠病学家:59.57%,外科医生:65.96%,放射科医师:55.32%)。在结构化报告中,MedOrchestra准确率达85.42%,在所有设置中均展现出明显优势。


LayerIF: Estimating Layer Quality for Large Language Models using Influence Functions

Abstract

arXiv:2505.23811v1 Announce Type: cross Abstract: Pretrained Large Language Models (LLMs) achieve strong performance across a wide range of tasks, yet exhibit substantial variability in the various layers' training quality with respect to specific downstream applications, limiting their downstream performance.It is therefore critical to estimate layer-wise training quality in a manner that accounts for both model architecture and training data. However, existing approaches predominantly rely on model-centric heuristics (such as spectral statistics, outlier detection, or uniform allocation) while overlooking the influence of data. To address these limitations, we propose LayerIF, a data-driven framework that leverages Influence Functions to quantify the training quality of individual layers in a principled and task-sensitive manner. By isolating each layer's gradients and measuring the sensitivity of the validation loss to training examples by computing layer-wise influences, we derive data-driven estimates of layer importance. Notably, our method produces task-specific layer importance estimates for the same LLM, revealing how layers specialize for different test-time evaluation tasks. We demonstrate the utility of our scores by leveraging them for two downstream applications: (a) expert allocation in LoRA-MoE architectures and (b) layer-wise sparsity distribution for LLM pruning. Experiments across multiple LLM architectures demonstrate that our model-agnostic, influence-guided allocation leads to consistent gains in task performance.

摘要

预训练大语言模型(LLMs)在广泛任务中表现出色,但其各层针对特定下游应用的训练质量存在显著差异,这限制了模型的下游性能。因此,关键需要一种能兼顾模型架构与训练数据的逐层训练质量评估方法。现有方法主要依赖模型中心启发式策略(如频谱统计、异常值检测或均匀分配),却忽视了数据的影响。为此,我们提出LayerIF框架——通过影响函数以原则化且任务敏感的方式量化单层训练质量。该方法通过隔离各层梯度并计算逐层影响值来度量验证损失对训练样本的敏感性,从而获得数据驱动的层重要性估计。值得注意的是,本方法能为同一LLM生成任务特定的层重要性评估,揭示不同测试任务中各层的专业化特性。我们通过两个下游应用验证了该评分的实用性:(a)LoRA-MoE架构中的专家分配;(b)LLM剪枝的逐层稀疏度分配。跨多种LLM架构的实验表明,这种与模型无关的影响引导分配策略能持续提升任务性能。


MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks

Abstract

arXiv:2505.23802v1 Announce Type: cross Abstract: While large language models (LLMs) achieve near-perfect scores on medical licensing exams, these evaluations inadequately reflect the complexity and diversity of real-world clinical practice. We introduce MedHELM, an extensible evaluation framework for assessing LLM performance for medical tasks with three key contributions. First, a clinician-validated taxonomy spanning 5 categories, 22 subcategories, and 121 tasks developed with 29 clinicians. Second, a comprehensive benchmark suite comprising 35 benchmarks (17 existing, 18 newly formulated) providing complete coverage of all categories and subcategories in the taxonomy. Third, a systematic comparison of LLMs with improved evaluation methods (using an LLM-jury) and a cost-performance analysis. Evaluation of 9 frontier LLMs, using the 35 benchmarks, revealed significant performance variation. Advanced reasoning models (DeepSeek R1: 66% win-rate; o3-mini: 64% win-rate) demonstrated superior performance, though Claude 3.5 Sonnet achieved comparable results at 40% lower estimated computational cost. On a normalized accuracy scale (0-1), most models performed strongly in Clinical Note Generation (0.73-0.85) and Patient Communication & Education (0.78-0.83), moderately in Medical Research Assistance (0.65-0.75), and generally lower in Clinical Decision Support (0.56-0.72) and Administration & Workflow (0.53-0.63). Our LLM-jury evaluation method achieved good agreement with clinician ratings (ICC = 0.47), surpassing both average clinician-clinician agreement (ICC = 0.43) and automated baselines including ROUGE-L (0.36) and BERTScore-F1 (0.44). Claude 3.5 Sonnet achieved comparable performance to top models at lower estimated cost. These findings highlight the importance of real-world, task-specific evaluation for medical use of LLMs and provides an open source framework to enable this.

摘要

虽然大型语言模型(LLMs)在医疗执照考试中取得接近满分的成绩,但这些评估未能充分反映真实临床实践的复杂性与多样性。我们提出MedHELM——一个可扩展的评估框架,用于衡量LLMs在医疗任务中的表现,其贡献包含三方面:首先,通过与29位临床医生合作开发了包含5大类、22个子类及121项任务的临床验证分类体系;其次,构建了涵盖该分类体系所有类别的35个基准测试组合(17个现有基准+18个新设计基准);第三,采用改进的评估方法(LLM陪审团机制)进行系统模型对比及成本-性能分析。对9个前沿LLMs的评估显示:高级推理模型(DeepSeek R1胜率66%;o3-mini胜率64%)表现最优,但Claude 3.5 Sonnet在估算计算成本降低40%的情况下达到相近水平。在标准化准确度评分(0-1)中,多数模型在临床记录生成(0.73-0.85)和患者沟通教育(0.78-0.83)表现优异,在医学研究辅助(0.65-0.75)中等,在临床决策支持(0.56-0.72)和行政流程(0.53-0.63)表现较弱。LLM陪审团评估法与临床医生评分具有良好一致性(ICC=0.47),优于临床医生间平均一致性(ICC=0.43)及ROUGE-L(0.36)、BERTScore-F1(0.44)等自动化基线。Claude 3.5 Sonnet以更低估算成本达到顶级模型水平。本研究揭示了针对医疗场景的任务特异性评估的重要性,并提供了实现该目标的开源框架。


DenseLoRA: Dense Low-Rank Adaptation of Large Language Models

Abstract

arXiv:2505.23808v1 Announce Type: cross Abstract: Low-rank adaptation (LoRA) has been developed as an efficient approach for adapting large language models (LLMs) by fine-tuning two low-rank matrices, thereby reducing the number of trainable parameters. However, prior research indicates that many of the weights in these matrices are redundant, leading to inefficiencies in parameter utilization. To address this limitation, we introduce Dense Low-Rank Adaptation (DenseLoRA), a novel approach that enhances parameter efficiency while achieving superior performance compared to LoRA. DenseLoRA builds upon the concept of representation fine-tuning, incorporating a single Encoder-Decoder to refine and compress hidden representations across all adaptation layers before applying adaptation. Instead of relying on two redundant low-rank matrices as in LoRA, DenseLoRA adapts LLMs through a dense low-rank matrix, improving parameter utilization and adaptation efficiency. We evaluate DenseLoRA on various benchmarks, showing that it achieves 83.8% accuracy with only 0.01% of trainable parameters, compared to LoRA's 80.8% accuracy with 0.70% of trainable parameters on LLaMA3-8B. Additionally, we conduct extensive experiments to systematically assess the impact of DenseLoRA's components on overall model performance. Code is available at https://github.com/mulin-ahu/DenseLoRA.

摘要

低秩自适应(LoRA)作为一种高效适应大语言模型(LLM)的方法被提出,其通过微调两个低秩矩阵来减少可训练参数数量。然而,先前研究表明这些矩阵中的许多权重是冗余的,导致参数利用效率低下。为解决这一局限,我们提出稠密低秩自适应(DenseLoRA),该方法在提升参数效率的同时,性能优于LoRA。DenseLoRA基于表示微调的概念,引入单个编码器-解码器结构,在实施自适应前对所有适配层的隐藏表示进行精炼与压缩。与LoRA依赖两个冗余低秩矩阵不同,DenseLoRA通过稠密低秩矩阵实现LLM适配,从而提升参数利用率和自适应效率。我们在多个基准测试上评估DenseLoRA,结果表明:在LLaMA3-8B模型上,DenseLoRA仅需0.01%可训练参数即可达到83.8%准确率,而LoRA使用0.70%参数仅获得80.8%准确率。此外,我们通过大量实验系统评估了DenseLoRA各组件对模型整体性能的影响。代码发布于https://github.com/mulin-ahu/DenseLoRA。


MARS-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation

Abstract

arXiv:2505.23810v1 Announce Type: cross Abstract: Large Language Models (\textbf{LLMs}), e.g. ChatGPT, have been widely adopted in real-world dialogue applications. However, LLMs' robustness, especially in handling long complex dialogue sessions, including frequent motivation transfer, sophisticated cross-turn dependency, is criticized all along. Nevertheless, no existing benchmarks can fully reflect these weaknesses. We present \textbf{MARS-Bench}, a \textbf{M}ulti-turn \textbf{A}thletic \textbf{R}eal-world \textbf{S}cenario Dialogue \textbf{Bench}mark, designed to remedy the gap. MARS-Bench is constructed from play-by-play text commentary so to feature realistic dialogues specifically designed to evaluate three critical aspects of multi-turn conversations: Ultra Multi-turn, Interactive Multi-turn, and Cross-turn Tasks. Extensive experiments on MARS-Bench also reveal that closed-source LLMs significantly outperform open-source alternatives, explicit reasoning significantly boosts LLMs' robustness on handling long complex dialogue sessions, and LLMs indeed face significant challenges when handling motivation transfer and sophisticated cross-turn dependency. Moreover, we provide mechanistic interpretability on how attention sinks due to special tokens lead to LLMs' performance degradation when handling long complex dialogue sessions based on attention visualization experiment in Qwen2.5-7B-Instruction.

摘要

大型语言模型(LLMs),如ChatGPT,已在现实对话应用中广泛采用。然而,LLMs的鲁棒性,尤其是在处理包含频繁动机转移、复杂跨轮次依赖的长复杂对话会话时,一直备受质疑。尽管如此,现有基准测试均无法全面反映这些缺陷。为此,我们提出MARS-Bench,即多轮竞技现实场景对话基准,旨在弥补这一空白。该基准基于实况文字解说构建,其特色在于通过专门设计的现实对话评估多轮会话的三大关键维度:超多轮、交互式多轮及跨轮次任务。在MARS-Bench上的大量实验表明:闭源LLMs显著优于开源模型,显式推理能大幅提升LLMs处理长复杂对话的鲁棒性,且LLMs在应对动机转移和复杂跨轮次依赖时确实面临重大挑战。此外,基于Qwen2.5-7B-Instruction的注意力可视化实验,我们从机制可解释性角度揭示了特殊令牌导致的注意力分散如何造成LLMs处理长复杂对话时的性能下降。


MultiPhishGuard: An LLM-based Multi-Agent System for Phishing Email Detection

Abstract

arXiv:2505.23803v1 Announce Type: cross Abstract: Phishing email detection faces critical challenges from evolving adversarial tactics and heterogeneous attack patterns. Traditional detection methods, such as rule-based filters and denylists, often struggle to keep pace with these evolving tactics, leading to false negatives and compromised security. While machine learning approaches have improved detection accuracy, they still face challenges adapting to novel phishing strategies. We present MultiPhishGuard, a dynamic LLM-based multi-agent detection system that synergizes specialized expertise with adversarial-aware reinforcement learning. Our framework employs five cooperative agents (text, URL, metadata, explanation simplifier, and adversarial agents) with automatically adjusted decision weights powered by a Proximal Policy Optimization reinforcement learning algorithm. To address emerging threats, we introduce an adversarial training loop featuring an adversarial agent that generates subtle context-aware email variants, creating a self-improving defense ecosystem and enhancing system robustness. Experimental evaluations on public datasets demonstrate that MultiPhishGuard significantly outperforms Chain-of-Thoughts, single-agent baselines and state-of-the-art detectors, as validated by ablation studies and comparative analyses. Experiments demonstrate that MultiPhishGuard achieves high accuracy (97.89%) with low false positive (2.73%) and false negative rates (0.20%). Additionally, we incorporate an explanation simplifier agent, which provides users with clear and easily understandable explanations for why an email is classified as phishing or legitimate. This work advances phishing defense through dynamic multi-agent collaboration and generative adversarial resilience.

摘要

钓鱼邮件检测面临着对抗策略不断演变和攻击模式多样化的关键挑战。传统检测方法(如基于规则的过滤器和黑名单)往往难以跟上这些动态变化的策略,导致漏报率上升和安全防护失效。尽管机器学习方法提升了检测准确率,但仍难以适应新型钓鱼策略。本文提出MultiPhishGuard——一个基于大语言模型的动态多智能体检测系统,通过融合领域专业知识和对抗感知的强化学习实现协同防御。该框架部署了五个协作智能体(文本、URL、元数据、解释简化和对抗智能体),其决策权重由近端策略优化强化学习算法自动调整。为应对新兴威胁,我们引入了包含对抗训练循环的对抗智能体,该智能体能生成细微的上下文感知邮件变体,从而构建自我优化的防御生态系统并增强系统鲁棒性。在公开数据集上的实验评估表明,MultiPhishGuard显著优于思维链方法、单智能体基线及当前最优检测器,消融研究和对比分析验证了这一结论。实验显示该系统实现了高准确率(97.89%)、低误报率(2.73%)和漏报率(0.20%)。此外,我们整合了解释简化智能体,可为用户提供关于邮件被判定为钓鱼或合法原因的清晰易懂的解释。本研究通过动态多智能体协作和生成式对抗韧性推动了钓鱼防御技术的进步。


Seven Security Challenges That Must be Solved in Cross-domain Multi-agent LLM Systems

Abstract

arXiv:2505.23847v1 Announce Type: cross Abstract: Large language models (LLMs) are rapidly evolving into autonomous agents that cooperate across organizational boundaries, enabling joint disaster response, supply-chain optimization, and other tasks that demand decentralized expertise without surrendering data ownership. Yet, cross-domain collaboration shatters the unified trust assumptions behind current alignment and containment techniques. An agent benign in isolation may, when receiving messages from an untrusted peer, leak secrets or violate policy, producing risks driven by emergent multi-agent dynamics rather than classical software bugs. This position paper maps the security agenda for cross-domain multi-agent LLM systems. We introduce seven categories of novel security challenges, for each of which we also present plausible attacks, security evaluation metrics, and future research guidelines.

摘要

大语言模型(LLMs)正迅速演化为跨组织边界协作的自主智能体,能够实现联合灾害响应、供应链优化等需要分散化专业知识而不放弃数据所有权的任务。然而,跨领域协作打破了当前对齐与遏制技术背后的统一信任假设。一个孤立状态下良性的智能体,在接收来自不可信伙伴的消息时,可能泄露机密或违反策略,由此产生的风险源自新兴的多智能体动态机制,而非传统软件缺陷。本立场文件系统梳理了跨领域多智能体LLM系统的安全议程,提出七类新型安全挑战,并为每类挑战提供可能的攻击案例、安全评估指标及未来研究方向建议。


Large Language Model-Based Agents for Automated Research Reproducibility: An Exploratory Study in Alzheimer's Disease

Abstract

arXiv:2505.23852v1 Announce Type: cross Abstract: Objective: To demonstrate the capabilities of Large Language Models (LLMs) as autonomous agents to reproduce findings of published research studies using the same or similar dataset. Materials and Methods: We used the "Quick Access" dataset of the National Alzheimer's Coordinating Center (NACC). We identified highly cited published research manuscripts using NACC data and selected five studies that appeared reproducible using this dataset alone. Using GPT-4o, we created a simulated research team of LLM-based autonomous agents tasked with writing and executing code to dynamically reproduce the findings of each study, given only study Abstracts, Methods sections, and data dictionary descriptions of the dataset. Results: We extracted 35 key findings described in the Abstracts across 5 Alzheimer's studies. On average, LLM agents approximately reproduced 53.2% of findings per study. Numeric values and range-based findings often differed between studies and agents. The agents also applied statistical methods or parameters that varied from the originals, though overall trends and significance were sometimes similar. Discussion: In some cases, LLM-based agents replicated research techniques and findings. In others, they failed due to implementation flaws or missing methodological detail. These discrepancies show the current limits of LLMs in fully automating reproducibility assessments. Still, this early investigation highlights the potential of structured agent-based systems to provide scalable evaluation of scientific rigor. Conclusion: This exploratory work illustrates both the promise and limitations of LLMs as autonomous agents for automating reproducibility in biomedical research.

摘要

目的:验证大型语言模型(LLMs)作为自主代理,利用相同或相似数据集复现已发表研究成果的能力。


Large Language Models Often Know When They Are Being Evaluated

Abstract

arXiv:2505.23836v1 Announce Type: cross Abstract: If AI models can detect when they are being evaluated, the effectiveness of evaluations might be compromised. For example, models could have systematically different behavior during evaluations, leading to less reliable benchmarks for deployment and governance decisions. We investigate whether frontier language models can accurately classify transcripts based on whether they originate from evaluations or real-world deployment, a capability we call evaluation awareness. To achieve this, we construct a diverse benchmark of 1,000 prompts and transcripts from 61 distinct datasets. These span public benchmarks (e.g., MMLU, SWEBench), real-world deployment interactions, and agent trajectories from scaffolding frameworks (e.g., web-browsing agents). Frontier models clearly demonstrate above-random evaluation awareness (Gemini-2.5-Pro reaches an AUC of 0.830.83), but do not yet surpass our simple human baseline (AUC of 0.920.92). Furthermore, both AI models and humans are better at identifying evaluations in agentic settings compared to chat settings. Additionally, we test whether models can identify the purpose of the evaluation. Under multiple-choice and open-ended questioning, AI models far outperform random chance in identifying what an evaluation is testing for. Our results indicate that frontier models already exhibit a substantial, though not yet superhuman, level of evaluation-awareness. We recommend tracking this capability in future models.

摘要

如果AI模型能够检测到自身正在被评估,那么评估的有效性可能会受到影响。例如,在评估过程中模型可能表现出系统性不同的行为,从而导致面向部署和治理决策的基准测试可靠性降低。本研究探讨前沿语言模型是否能准确区分文本内容来源于评估场景还是真实世界部署——我们将这种能力称为"评估意识"。为此,我们构建了一个包含61个不同数据集的1000条提示文本与对话记录的多样化基准,涵盖公共基准测试(如MMLU、SWEBench)、真实世界部署交互以及脚手架框架生成的智能体轨迹(如网页浏览代理)。研究显示,前沿模型表现出显著高于随机水平的评估意识(Gemini-2.5-Pro的AUC值达0.83),但尚未超越我们设定的人类简单基线(AUC为0.92)。此外,无论是AI模型还是人类,在智能体场景中识别评估的能力都优于聊天场景。我们进一步测试了模型识别评估目的的能力:在多选和开放式提问设置下,AI模型在判断评估测试目标方面的表现远超随机水平。结果表明,前沿模型已展现出显著(虽未达超人水平)的评估意识。我们建议在未来模型研发中持续追踪该能力的发展。


OMNIGUARD: An Efficient Approach for AI Safety Moderation Across Modalities

Abstract

arXiv:2505.23856v1 Announce Type: cross Abstract: The emerging capabilities of large language models (LLMs) have sparked concerns about their immediate potential for harmful misuse. The core approach to mitigate these concerns is the detection of harmful queries to the model. Current detection approaches are fallible, and are particularly susceptible to attacks that exploit mismatched generalization of model capabilities (e.g., prompts in low-resource languages or prompts provided in non-text modalities such as image and audio). To tackle this challenge, we propose OMNIGUARD, an approach for detecting harmful prompts across languages and modalities. Our approach (i) identifies internal representations of an LLM/MLLM that are aligned across languages or modalities and then (ii) uses them to build a language-agnostic or modality-agnostic classifier for detecting harmful prompts. OMNIGUARD improves harmful prompt classification accuracy by 11.57% over the strongest baseline in a multilingual setting, by 20.44% for image-based prompts, and sets a new SOTA for audio-based prompts. By repurposing embeddings computed during generation, OMNIGUARD is also very efficient (120×\approx 120 \times faster than the next fastest baseline). Code and data are available at: https://github.com/vsahil/OmniGuard.

摘要

大语言模型(LLM)新兴能力的涌现引发了人们对其潜在恶意滥用的担忧。缓解这一问题的核心方法是检测针对模型的有害查询。现有检测方法存在缺陷,尤其容易受到利用模型能力不匹配泛化的攻击(例如使用低资源语言输入的提示,或以图像、音频等非文本模态提供的提示)。为应对这一挑战,我们提出OMNIGUARD方法,用于跨语言和跨模态的有害提示检测。该方法(i)识别LLM/MLLM中跨语言或跨模态对齐的内部表征,(ii)利用这些表征构建语言无关或模态无关的分类器来检测有害提示。OMNIGUARD在多语言环境下将有害提示分类准确率较最强基线提升11.57%,在图像提示场景提升20.44%,并为音频提示检测树立了新标杆。通过复用生成过程中计算的嵌入向量,该方法还具备极高效率(比次快基线快约120倍)。代码与数据详见:https://github.com/vsahil/OmniGuard。


Revisiting Uncertainty Estimation and Calibration of Large Language Models

Abstract

arXiv:2505.23854v1 Announce Type: cross Abstract: As large language models (LLMs) are increasingly deployed in high-stakes applications, robust uncertainty estimation is essential for ensuring the safe and trustworthy deployment of LLMs. We present the most comprehensive study to date of uncertainty estimation in LLMs, evaluating 80 models spanning open- and closed-source families, dense and Mixture-of-Experts (MoE) architectures, reasoning and non-reasoning modes, quantization variants and parameter scales from 0.6B to 671B. Focusing on three representative black-box single-pass methods, including token probability-based uncertainty (TPU), numerical verbal uncertainty (NVU), and linguistic verbal uncertainty (LVU), we systematically evaluate uncertainty calibration and selective classification using the challenging MMLU-Pro benchmark, which covers both reasoning-intensive and knowledge-based tasks. Our results show that LVU consistently outperforms TPU and NVU, offering stronger calibration and discrimination while being more interpretable. We also find that high accuracy does not imply reliable uncertainty, and that model scale, post-training, reasoning ability and quantization all influence estimation performance. Notably, LLMs exhibit better uncertainty estimates on reasoning tasks than on knowledge-heavy ones, and good calibration does not necessarily translate to effective error ranking. These findings highlight the need for multi-perspective evaluation and position LVU as a practical tool for improving the reliability of LLMs in real-world settings.

摘要

随着大语言模型(LLMs)在高风险应用中的日益普及,稳健的不确定性估计对于确保其安全可信部署至关重要。本研究针对LLMs不确定性估计开展了迄今最全面的评估,涵盖80个模型,涉及开源与闭源体系、稠密与专家混合(MoE)架构、推理与非推理模式、量化变体以及0.6B至671B参数规模。通过聚焦三种代表性黑盒单次评估方法——基于标记概率的不确定性(TPU)、数值语言不确定性(NVU)和语言学语言不确定性(LVU),我们采用具有挑战性的MMLU-Pro基准系统评估了不确定性校准和选择性分类性能,该基准覆盖推理密集型和知识密集型任务。研究结果表明:LVU在校准性和判别力方面持续优于TPU与NVU,同时具备更强的可解释性;高准确率并不意味着可靠的不确定性,模型规模、训练后优化、推理能力和量化处理均会影响估计性能。值得注意的是,LLMs在推理任务上表现出比知识密集型任务更优的不确定性估计,且良好的校准性并不直接转化为有效的错误排序能力。这些发现凸显了多视角评估的必要性,并将LVU确立为提升实际应用中LLMs可靠性的实用工具。


MaCP: Minimal yet Mighty Adaptation via Hierarchical Cosine Projection

Abstract

arXiv:2505.23870v1 Announce Type: cross Abstract: We present a new adaptation method MaCP, Minimal yet Mighty adaptive Cosine Projection, that achieves exceptional performance while requiring minimal parameters and memory for fine-tuning large foundation models. Its general idea is to exploit the superior energy compaction and decorrelation properties of cosine projection to improve both model efficiency and accuracy. Specifically, it projects the weight change from the low-rank adaptation into the discrete cosine space. Then, the weight change is partitioned over different levels of the discrete cosine spectrum, and each partition's most critical frequency components are selected. Extensive experiments demonstrate the effectiveness of MaCP across a wide range of single-modality tasks, including natural language understanding, natural language generation, text summarization, as well as multi-modality tasks such as image classification and video understanding. MaCP consistently delivers superior accuracy, significantly reduced computational complexity, and lower memory requirements compared to existing alternatives.

摘要

我们提出了一种新型自适应方法MaCP(最小但强大的自适应余弦投影),该方法在仅需极少参数和内存的情况下,即可实现对大型基础模型的高效微调,同时获得卓越性能。其核心思想是利用余弦投影优异的能量压缩与去相关特性,同步提升模型效率和精度。具体而言,该方法将低秩自适应中的权重变化投影至离散余弦空间,随后将权重变化按离散余弦频谱的不同层级进行划分,并筛选各分区中最关键频率成分。大量实验证明,MaCP在自然语言理解、自然语言生成、文本摘要等单模态任务,以及图像分类、视频理解等多模态任务中均表现出卓越效果。相较于现有方法,MaCP始终展现出更高的准确度、显著降低的计算复杂度及更少的内存需求。


Noise-Robustness Through Noise: Asymmetric LoRA Adaption with Poisoning Expert

Abstract

arXiv:2505.23868v1 Announce Type: cross Abstract: Current parameter-efficient fine-tuning methods for adapting pre-trained language models to downstream tasks are susceptible to interference from noisy data. Conventional noise-handling approaches either rely on laborious data pre-processing or employ model architecture modifications prone to error accumulation. In contrast to existing noise-process paradigms, we propose a noise-robust adaptation method via asymmetric LoRA poisoning experts (LoPE), a novel framework that enhances model robustness to noise only with generated noisy data. Drawing inspiration from the mixture-of-experts architecture, LoPE strategically integrates a dedicated poisoning expert in an asymmetric LoRA configuration. Through a two-stage paradigm, LoPE performs noise injection on the poisoning expert during fine-tuning to enhance its noise discrimination and processing ability. During inference, we selectively mask the dedicated poisoning expert to leverage purified knowledge acquired by normal experts for noise-robust output. Extensive experiments demonstrate that LoPE achieves strong performance and robustness purely through the low-cost noise injection, which completely eliminates the requirement of data cleaning.

摘要

当前针对预训练语言模型下游任务适配的参数高效微调方法易受噪声数据干扰。传统噪声处理方法要么依赖繁琐的数据预处理,要么采用容易导致误差累积的模型架构修改。不同于现有噪声处理范式,我们提出一种基于非对称LoRA污染专家(LoPE)的噪声鲁棒适配方法,该框架仅需生成噪声数据即可增强模型抗噪能力。受混合专家架构启发,LoPE在非对称LoRA配置中策略性地集成专用污染专家,通过两阶段范式在微调时对污染专家进行噪声注入以增强其噪声判别与处理能力。推理阶段选择性屏蔽该污染专家,利用正常专家习得的净化知识生成抗噪输出。大量实验表明,LoPE仅通过低成本噪声注入即可实现卓越性能与鲁棒性,完全无需数据清洗步骤。


ASyMOB: Algebraic Symbolic Mathematical Operations Benchmark

Abstract

arXiv:2505.23851v1 Announce Type: cross Abstract: Large language models (LLMs) are rapidly approaching the level of proficiency in university-level symbolic mathematics required for applications in advanced science and technology. However, existing benchmarks fall short in assessing the core skills of LLMs in symbolic mathematics-such as integration, differential equations, and algebraic simplification. To address this gap, we introduce ASyMOB, a novel assessment framework focused exclusively on symbolic manipulation, featuring 17,092 unique math challenges, organized by similarity and complexity. ASyMOB enables analysis of LLM generalization capabilities by comparing performance in problems that differ by simple numerical or symbolic `perturbations'. Evaluated LLMs exhibit substantial degradation in performance for all perturbation types (up to -70.3%), suggesting reliance on memorized patterns rather than deeper understanding of symbolic math, even among models achieving high baseline accuracy. Comparing LLM performance to computer algebra systems, we identify examples where they fail while LLMs succeed, as well as problems solved only by combining both approaches. Models capable of integrated code execution yielded higher accuracy compared to their performance without code, particularly stabilizing weaker models (up to +33.1% for certain perturbation types). Notably, the most advanced models (o4-mini, Gemini 2.5 Flash) demonstrate not only high symbolic math proficiency (scoring 96.8% and 97.6% on the unperturbed set), but also remarkable robustness against perturbations, (-21.7% and -21.2% vs. average -50.4% for the other models). This may indicate a recent "phase transition" in the generalization capabilities of frontier LLMs. It remains to be seen whether the path forward lies in deeper integration with sophisticated external tools, or in developing models so capable that symbolic math systems like CAS become unnecessary.

摘要

大型语言模型(LLMs)在高等科学与技术应用所需的大学水平符号数学能力方面正迅速接近熟练程度。然而,现有基准测试在评估LLMs符号数学核心能力(如积分、微分方程和代数简化)方面存在不足。为填补这一空白,我们提出ASyMOB——一个专注于符号操作的新型评估框架,包含17,092个独特数学挑战,按相似性和复杂性组织。ASyMOB通过比较模型在简单数值或符号"扰动"问题中的表现,实现对LLM泛化能力的分析。评估显示,所有LLM在各类扰动下性能均显著下降(最高达-70.3%),表明其依赖记忆模式而非对符号数学的深层理解,即使在基线准确率较高的模型中亦然。通过对比LLM与计算机代数系统的表现,我们发现了LLM成功而传统系统失败的案例,以及需要两者结合才能解决的问题。支持代码执行的模型相比纯文本推理展现出更高准确率(特定扰动类型最高提升+33.1%),尤其能稳定较弱模型的表现。值得注意的是,最先进模型(o4-mini、Gemini 2.5 Flash)不仅展现出卓越的符号数学能力(在未扰动集上分别获得96.8%和97.6%的分数),还对扰动表现出惊人鲁棒性(-21.7%和-21.2% vs 其他模型平均-50.4%)。这可能预示着前沿LLMs泛化能力近期发生了"相变"。未来发展方向究竟是与复杂外部工具的深度整合,还是开发出足以使计算机代数系统(CAS)变得多余的超强模型,仍有待观察。


Infi-Med: Low-Resource Medical MLLMs with Robust Reasoning Evaluation

Abstract

arXiv:2505.23867v1 Announce Type: cross Abstract: Multimodal large language models (MLLMs) have demonstrated promising prospects in healthcare, particularly for addressing complex medical tasks, supporting multidisciplinary treatment (MDT), and enabling personalized precision medicine. However, their practical deployment faces critical challenges in resource efficiency, diagnostic accuracy, clinical considerations, and ethical privacy. To address these limitations, we propose Infi-Med, a comprehensive framework for medical MLLMs that introduces three key innovations: (1) a resource-efficient approach through curating and constructing high-quality supervised fine-tuning (SFT) datasets with minimal sample requirements, with a forward-looking design that extends to both pretraining and posttraining phases; (2) enhanced multimodal reasoning capabilities for cross-modal integration and clinical task understanding; and (3) a systematic evaluation system that assesses model performance across medical modalities and task types. Our experiments demonstrate that Infi-Med achieves state-of-the-art (SOTA) performance in general medical reasoning while maintaining rapid adaptability to clinical scenarios. The framework establishes a solid foundation for deploying MLLMs in real-world healthcare settings by balancing model effectiveness with operational constraints.

摘要

多模态大语言模型(MLLMs)在医疗健康领域展现出广阔前景,尤其在处理复杂医学任务、支持多学科诊疗(MDT)和实现个性化精准医疗方面具有潜力。然而,其实际应用仍面临资源效率、诊断准确性、临床考量和伦理隐私等关键挑战。为突破这些局限,我们提出Infi-Med医疗MLLMs综合框架,包含三项核心创新:(1)通过精选和构建高质量监督微调(SFT)数据集实现资源高效利用,该前瞻性设计可扩展至预训练与训练后阶段;(2)增强跨模态整合与临床任务理解的多模态推理能力;(3)建立系统性评估体系,全面衡量模型在医学模态与任务类型上的表现。实验表明,Infi-Med在通用医学推理任务中达到最先进(SOTA)水平,同时保持对临床场景的快速适应能力。该框架通过平衡模型效能与操作约束,为MLLMs在真实医疗环境中的部署奠定了坚实基础。


Reinforcement Learning for Better Verbalized Confidence in Long-Form Generation

Abstract

arXiv:2505.23912v1 Announce Type: cross Abstract: Hallucination remains a major challenge for the safe and trustworthy deployment of large language models (LLMs) in factual content generation. Prior work has explored confidence estimation as an effective approach to hallucination detection, but often relies on post-hoc self-consistency methods that require computationally expensive sampling. Verbalized confidence offers a more efficient alternative, but existing approaches are largely limited to short-form question answering (QA) tasks and do not generalize well to open-ended generation. In this paper, we propose LoVeC (Long-form Verbalized Confidence), an on-the-fly verbalized confidence estimation method for long-form generation. Specifically, we use reinforcement learning (RL) to train LLMs to append numerical confidence scores to each generated statement, serving as a direct and interpretable signal of the factuality of generation. Our experiments consider both on-policy and off-policy RL methods, including DPO, ORPO, and GRPO, to enhance the model calibration. We introduce two novel evaluation settings, free-form tagging and iterative tagging, to assess different verbalized confidence estimation methods. Experiments on three long-form QA datasets show that our RL-trained models achieve better calibration and generalize robustly across domains. Also, our method is highly efficient, as it only requires adding a few tokens to the output being decoded.

摘要

幻觉问题仍然是大型语言模型(LLMs)在事实性内容生成中安全可信部署的主要挑战。先前研究探索了置信度估计作为幻觉检测的有效方法,但通常依赖于需要昂贵计算采样的后验自一致性方法。语言化置信度提供了一种更高效的替代方案,但现有方法大多局限于短形式问答(QA)任务,难以推广至开放式生成。本文提出LoVeC(长形式语言化置信度),一种面向长文本生成的实时语言化置信度估计方法。具体而言,我们采用强化学习(RL)训练LLMs为每个生成语句附加数值置信度分数,作为生成事实性的直接可解释信号。实验涵盖了策略内与策略外RL方法(包括DPO、ORPO和GRPO)以增强模型校准。我们引入两种新颖的评估设置——自由标注和迭代标注,用以评估不同语言化置信度估计方法。在三个长形式QA数据集上的实验表明,经RL训练的模型具有更好的校准性,并能跨领域稳健泛化。此外,该方法仅需在解码输出中添加少量标记,具有极高效率。


Probing Association Biases in LLM Moderation Over-Sensitivity

Abstract

arXiv:2505.23914v1 Announce Type: cross Abstract: Large Language Models are widely used for content moderation but often misclassify benign comments as toxic, leading to over-sensitivity. While previous research attributes this issue primarily to the presence of offensive terms, we reveal a potential cause beyond token level: LLMs exhibit systematic topic biases in their implicit associations. Inspired by cognitive psychology's implicit association tests, we introduce Topic Association Analysis, a semantic-level approach to quantify how LLMs associate certain topics with toxicity. By prompting LLMs to generate free-form scenario imagination for misclassified benign comments and analyzing their topic amplification levels, we find that more advanced models (e.g., GPT-4 Turbo) demonstrate stronger topic stereotype despite lower overall false positive rates. These biases suggest that LLMs do not merely react to explicit, offensive language but rely on learned topic associations, shaping their moderation decisions. Our findings highlight the need for refinement beyond keyword-based filtering, providing insights into the underlying mechanisms driving LLM over-sensitivity.

摘要

大语言模型被广泛用于内容审核,但常将良性评论误判为有害内容,表现出过度敏感性。尽管先前研究主要将此问题归因于冒犯性词汇的存在,但我们发现了一个超越词元层面的潜在原因:大语言模型在其隐性关联中表现出系统性的主题偏见。受认知心理学中内隐联想测试的启发,我们提出了主题关联分析法——一种语义层面的方法,用于量化大语言模型如何将特定主题与毒性相关联。通过提示大语言模型对误判的良性评论生成自由形式的情景想象,并分析其主题放大程度,我们发现更先进的模型(如GPT-4 Turbo)尽管总体误报率较低,却表现出更强的主题刻板印象。这些偏见表明,大语言模型不仅对显性的冒犯性语言作出反应,还依赖于习得的主题关联来形成审核决策。我们的研究结果强调,需要超越基于关键词的过滤方式进行改进,并为理解驱动大语言模型过度敏感性的内在机制提供了新见解。


Actor-Critic based Online Data Mixing For Language Model Pre-Training

Abstract

arXiv:2505.23878v1 Announce Type: cross Abstract: The coverage and composition of pretraining data significantly impacts the generalization ability of Large Language Models (LLMs). To reduce the carbon footprint and financial costs of training, some data mixing methods, which applied the optimized domain weights of a small proxy model to train a larger one, were proposed. However, these methods did not evolute with the training dynamics. The existing online data mixing (ODM) method addressed this limitation by applying the multi-armed bandit algorithm as data sampling strategy. Yet, it did not consider the intra-domain interactions. In this paper, we develop an actor-critic based online data mixing (AC-ODM) method, which captures the varying domain weights by auxiliary actor-critic networks and consider the intra-domain interactions with the reward function. While constructing the dataset to pretrain a large target LLM, we directly apply the actor, which is trained with a small proxy LLM as the environment, as the sampling strategy. The transfer of sampling strategy can not only ensure the efficiency of dynamical data mixing, but also expedite the convergence of pretraining the target LLM. Numerical results demonstrate that AC-ODM-410M, which invokes the sampling strategy obtained by a proxy LLM with 410M parameters, reaching the optimal validation perplexity of ODM 71% faster, and improves performance on the zero-shot MMLU benchmark by 27.5% of accuracy, about 2.23x better on pass@1 of HumanEval benchmark.

摘要

预训练数据的覆盖范围和组成显著影响大语言模型(LLM)的泛化能力。为降低训练过程的碳排放与经济成本,现有研究提出了若干数据混合方法——通过将小型代理模型优化的领域权重应用于更大模型的训练。然而这些方法未能随训练动态演化。现有在线数据混合(ODM)方法采用多臂老虎机算法作为数据采样策略以解决该局限,但未考虑域内交互作用。本文提出基于行动者-评论家框架的在线数据混合方法(AC-ODM),通过辅助行动者-评论家网络捕捉动态领域权重,并利用奖励函数考虑域内交互。在构建大型目标LLM预训练数据集时,我们直接将以小型代理LLM作为环境训练的行动者网络作为采样策略。该策略迁移不仅能保证动态数据混合的效率,还可加速目标LLM预训练的收敛。数值实验表明:调用410M参数代理LLM所得采样策略的AC-ODM-410M,其验证困惑度达到最优值速度比ODM快71%,在零样本MMLU基准测试中准确率提升27.5%,HumanEval基准测试pass@1指标提高约2.23倍。


A Closer Look at Bias and Chain-of-Thought Faithfulness of Large (Vision) Language Models

Abstract

arXiv:2505.23945v1 Announce Type: cross Abstract: Chain-of-thought (CoT) reasoning enhances performance of large language models, but questions remain about whether these reasoning traces faithfully reflect the internal processes of the model. We present the first comprehensive study of CoT faithfulness in large vision-language models (LVLMs), investigating how both text-based and previously unexplored image-based biases affect reasoning and bias articulation. Our work introduces a novel, fine-grained evaluation pipeline for categorizing bias articulation patterns, enabling significantly more precise analysis of CoT reasoning than previous methods. This framework reveals critical distinctions in how models process and respond to different types of biases, providing new insights into LVLM CoT faithfulness. Our findings reveal that subtle image-based biases are rarely articulated compared to explicit text-based ones, even in models specialized for reasoning. Additionally, many models exhibit a previously unidentified phenomenon we term ``inconsistent'' reasoning - correctly reasoning before abruptly changing answers, serving as a potential canary for detecting biased reasoning from unfaithful CoTs. We then apply the same evaluation pipeline to revisit CoT faithfulness in LLMs across various levels of implicit cues. Our findings reveal that current language-only reasoning models continue to struggle with articulating cues that are not overtly stated.

摘要

思维链(CoT)推理提升了大型语言模型的性能,但这些推理轨迹是否真实反映模型的内部处理过程仍存疑问。我们首次对大型视觉语言模型(LVLM)中的CoT忠实性进行全面研究,探讨了基于文本的偏见和先前未被探索的基于图像的偏见如何影响推理与偏误表达。本研究提出了一种新颖的细粒度评估流程,用于对偏误表达模式进行分类,相比现有方法能实现更精确的CoT推理分析。该框架揭示了模型在处理和响应不同类型偏见时的关键差异,为LVLM的CoT忠实性提供了新见解。研究发现:与显性文本偏见相比,即使是专为推理优化的模型也极少表达细微的图像偏见;此外,许多模型表现出我们称之为"不一致推理"的新现象——在正确推理后突然改变答案,这可作为检测非忠实CoT偏误推理的潜在预警信号。我们随后将该评估流程应用于重新检验语言模型(LLM)在不同隐含线索层级下的CoT忠实性,发现当前纯语言推理模型仍难以准确表达未明确陈述的隐含线索。


Information Structure in Mappings: An Approach to Learning, Representation, and Generalisation

Abstract

arXiv:2505.23960v1 Announce Type: cross Abstract: Despite the remarkable success of large large-scale neural networks, we still lack unified notation for thinking about and describing their representational spaces. We lack methods to reliably describe how their representations are structured, how that structure emerges over training, and what kinds of structures are desirable. This thesis introduces quantitative methods for identifying systematic structure in a mapping between spaces, and leverages them to understand how deep-learning models learn to represent information, what representational structures drive generalisation, and how design decisions condition the structures that emerge. To do this I identify structural primitives present in a mapping, along with information theoretic quantifications of each. These allow us to analyse learning, structure, and generalisation across multi-agent reinforcement learning models, sequence-to-sequence models trained on a single task, and Large Language Models. I also introduce a novel, performant, approach to estimating the entropy of vector space, that allows this analysis to be applied to models ranging in size from 1 million to 12 billion parameters. The experiments here work to shed light on how large-scale distributed models of cognition learn, while allowing us to draw parallels between those systems and their human analogs. They show how the structures of language and the constraints that give rise to them in many ways parallel the kinds of structures that drive performance of contemporary neural networks.

摘要

尽管大规模神经网络取得了显著成功,我们仍缺乏统一的符号体系来描述和理解其表征空间。当前我们难以可靠地描述这些表征的结构特征、该结构在训练过程中如何形成,以及何种结构具有优越性。本论文提出了一套量化方法来识别空间映射中的系统性结构,并运用这些方法揭示深度学习模型如何学习信息表征、哪些表征结构驱动泛化能力,以及设计决策如何影响最终形成的结构。为此,我首先识别映射中存在的基础结构要素,并给出每个要素的信息论量化指标。这些工具使我们能够分析多智能体强化学习模型、单任务训练的序列到序列模型以及大语言模型中的学习过程、结构特征与泛化表现。此外,我提出了一种新颖高效的方法来估计向量空间的熵,使得该分析可应用于参数量从100万到120亿不等的各类模型。本研究旨在阐明大规模分布式认知模型的学习机制,同时揭示这些系统与人类认知系统的相似性。实验结果表明,语言的结构特征及其形成约束条件,与当代神经网络性能驱动的结构特征存在多方面的对应关系。


Large Language Models for Controllable Multi-property Multi-objective Molecule Optimization

Abstract

arXiv:2505.23987v1 Announce Type: cross Abstract: In real-world drug design, molecule optimization requires selectively improving multiple molecular properties up to pharmaceutically relevant levels, while maintaining others that already meet such criteria. However, existing computational approaches and instruction-tuned LLMs fail to capture such nuanced property-specific objectives, limiting their practical applicability. To address this, we introduce C-MuMOInstruct, the first instruction-tuning dataset focused on multi-property optimization with explicit, property-specific objectives. Leveraging C-MuMOInstruct, we develop GeLLMO-Cs, a series of instruction-tuned LLMs that can perform targeted property-specific optimization. Our experiments across 5 in-distribution and 5 out-of-distribution tasks show that GeLLMO-Cs consistently outperform strong baselines, achieving up to 126% higher success rate. Notably, GeLLMO-Cs exhibit impressive 0-shot generalization to novel optimization tasks and unseen instructions. This offers a step toward a foundational LLM to support realistic, diverse optimizations with property-specific objectives. C-MuMOInstruct and code are accessible through https://github.com/ninglab/GeLLMO-C.

摘要

在实际药物设计中,分子优化需要选择性地将多个分子特性提升至制药相关水平,同时保持其他已达标的特性不变。然而现有计算方法和指令调优大语言模型均无法捕捉这种具有属性特异性的细微优化目标,限制了其实际应用价值。为此,我们提出了首个专注于多属性优化且具有明确属性特异性目标的指令调优数据集C-MuMOInstruct。基于该数据集,我们开发了GeLLMO-Cs系列指令调优大语言模型,可实现针对特定属性的定向优化。在5个分布内和5个分布外任务上的实验表明,GeLLMO-Cs始终优于强基线模型,成功率最高提升达126%。值得注意的是,GeLLMO-Cs对新型优化任务和未见指令展现出优异的零样本泛化能力。这为构建支持具有属性特异性目标的现实多样化优化的基础大语言模型迈出了重要一步。C-MuMOInstruct数据集及相关代码可通过https://github.com/ninglab/GeLLMO-C获取。


TSENOR: Highly-Efficient Algorithm for Finding Transposable N:M Sparse Masks

Abstract

arXiv:2505.23949v1 Announce Type: cross Abstract: Network pruning reduces the computational requirements of large neural networks, with N:M sparsity -- retaining only N out of every M consecutive weights -- offering a compelling balance between compressed model quality and hardware acceleration. However, N:M sparsity only accelerates forward-pass computations, as N:M patterns are not preserved during matrix transposition, limiting efficiency during training where both passes are computationally intensive. While transposable N:M sparsity has been proposed to address this limitation, existing methods for finding transposable N:M sparse masks either fail to scale to large models or are restricted to M=4 which results in suboptimal compression-accuracy trade-off. We introduce an efficient solver for transposable N:M masks that scales to billion-parameter models. We formulate mask generation as optimal transport problems and solve through entropy regularization and Dykstra's algorithm, followed by a rounding procedure. Our tensor-based implementation exploits GPU parallelism, achieving up to 100x speedup with only 1-10% error compared to existing methods. Our approach can be integrated with layer-wise N:M pruning frameworks including Wanda, SparseGPT and ALPS to produce transposable N:M sparse models with arbitrary N:M values. Experiments show that LLaMA3.2-8B with transposable 16:32 sparsity maintains performance close to its standard N:M counterpart and outperforms standard 2:4 sparse model, showing the practical value of our approach.

摘要

网络剪枝能够降低大型神经网络的计算需求,其中N:M稀疏性(即每M个连续权重中仅保留N个)在压缩模型质量与硬件加速之间提供了理想的平衡。然而,N:M稀疏性仅能加速前向计算过程,因为矩阵转置时N:M模式无法保持,限制了训练阶段(正反向计算均密集)的效率。虽然可转置N:M稀疏性已被提出以解决此限制,但现有方法在生成可转置N:M稀疏掩码时,要么难以扩展至大型模型,要么仅限于M=4的次优压缩-精度权衡方案。我们提出一种高效的可转置N:M掩码求解器,可扩展至十亿参数模型。通过将掩码生成建模为最优传输问题,采用熵正则化与Dykstra算法求解,并辅以舍入流程实现。基于张量的GPU并行实现相较现有方法提速达100倍,误差仅1-10%。该方法可与Wanda、SparseGPT及ALPS等分层N:M剪枝框架结合,生成任意N:M值的可转置稀疏模型。实验表明,采用可转置16:32稀疏化的LLaMA3.2-8B模型性能接近标准N:M版本,且优于标准2:4稀疏模型,验证了本方法的实用价值。


Is Your Model Fairly Certain? Uncertainty-Aware Fairness Evaluation for LLMs

Abstract

arXiv:2505.23996v1 Announce Type: cross Abstract: The recent rapid adoption of large language models (LLMs) highlights the critical need for benchmarking their fairness. Conventional fairness metrics, which focus on discrete accuracy-based evaluations (i.e., prediction correctness), fail to capture the implicit impact of model uncertainty (e.g., higher model confidence about one group over another despite similar accuracy). To address this limitation, we propose an uncertainty-aware fairness metric, UCerF, to enable a fine-grained evaluation of model fairness that is more reflective of the internal bias in model decisions compared to conventional fairness measures. Furthermore, observing data size, diversity, and clarity issues in current datasets, we introduce a new gender-occupation fairness evaluation dataset with 31,756 samples for co-reference resolution, offering a more diverse and suitable dataset for evaluating modern LLMs. We establish a benchmark, using our metric and dataset, and apply it to evaluate the behavior of ten open-source LLMs. For example, Mistral-7B exhibits suboptimal fairness due to high confidence in incorrect predictions, a detail overlooked by Equalized Odds but captured by UCerF. Overall, our proposed LLM benchmark, which evaluates fairness with uncertainty awareness, paves the way for developing more transparent and accountable AI systems.

摘要

最近大规模语言模型(LLMs)的快速普及凸显了对其公平性进行基准测试的迫切需求。传统公平性指标聚焦于基于离散准确率的评估(即预测正确性),却未能捕捉模型不确定性的隐性影响(例如,尽管准确率相近,模型对某一群体的预测置信度显著高于另一群体)。为解决这一局限,我们提出一种不确定性感知的公平性指标UCerF,相较于传统公平性度量,该指标能通过细粒度评估更准确地反映模型决策中的内部偏差。此外,针对现有数据集中存在的样本规模、多样性和清晰度问题,我们构建了一个包含31,756个样本的性别-职业共指消解公平性评估数据集,为评估现代LLMs提供了更具多样性和适配性的测试资源。基于该指标和数据集,我们建立了基准测试框架,并应用于评估十个开源LLMs的表现。例如,Mistral-7B由于对错误预测表现出高置信度而存在公平性缺陷,这一现象被均等几率指标忽略,但被UCerF有效捕捉。总体而言,我们提出的具有不确定性感知能力的LLM公平性评估基准,为开发更透明、可问责的AI系统奠定了基础。


Large Language Model Meets Constraint Propagation

Abstract

arXiv:2505.24012v1 Announce Type: cross Abstract: Large Language Models (LLMs) excel at generating fluent text but struggle to enforce external constraints because they generate tokens sequentially without explicit control mechanisms. GenCP addresses this limitation by combining LLM predictions with Constraint Programming (CP) reasoning, formulating text generation as a Constraint Satisfaction Problem (CSP). In this paper, we improve GenCP by integrating Masked Language Models (MLMs) for domain generation, which allows bidirectional constraint propagation that leverages both past and future tokens. This integration bridges the gap between token-level prediction and structured constraint enforcement, leading to more reliable and constraint-aware text generation. Our evaluation on COLLIE benchmarks demonstrates that incorporating domain preview via MLM calls significantly improves GenCP's performance. Although this approach incurs additional MLM calls and, in some cases, increased backtracking, the overall effect is a more efficient use of LLM inferences and an enhanced ability to generate feasible and meaningful solutions, particularly in tasks with strict content constraints.

摘要

大语言模型(LLMs)擅长生成流畅文本,但由于其采用顺序生成标记且缺乏显式控制机制,难以强制执行外部约束。GenCP通过将LLM预测与约束规划(CP)推理相结合,将文本生成建模为约束满足问题(CSP),从而解决这一局限。本文通过集成掩码语言模型(MLMs)进行域生成来改进GenCP,该方法支持双向约束传播,可同时利用历史与未来标记。这种整合弥合了标记级预测与结构化约束执行之间的鸿沟,实现了更可靠且具备约束感知的文本生成。在COLLIE基准测试中,评估表明通过MLM调用实现的域预览显著提升了GenCP的性能。尽管该方法会引入额外MLM调用及部分情况下回溯次数增加,但整体效果是更高效地利用LLM推理能力,并增强生成可行且有意义解的能力,尤其在具有严格内容约束的任务中表现突出。


Enhancing LLM-Based Code Generation with Complexity Metrics: A Feedback-Driven Approach

Abstract

arXiv:2505.23953v1 Announce Type: cross Abstract: Automatic code generation has gained significant momentum with the advent of Large Language Models (LLMs) such as GPT-4. Although many studies focus on improving the effectiveness of LLMs for code generation, very limited work tries to understand the generated code's characteristics and leverage that to improve failed cases. In this paper, as the most straightforward characteristic of code, we investigate the relationship between code complexity and the success of LLM generated code. Using a large set of standard complexity metrics, we first conduct an empirical analysis to explore their correlation with LLM's performance on code generation (i.e., Pass@1). Using logistic regression models, we identify which complexity metrics are most predictive of code correctness. Building on these findings, we propose an iterative feedback method, where LLMs are prompted to generate correct code based on complexity metrics from previous failed outputs. We validate our approach across multiple benchmarks (i.e., HumanEval, MBPP, LeetCode, and BigCodeBench) and various LLMs (i.e., GPT-4o, GPT-3.5 Turbo, Llama 3.1, and GPT-o3 mini), comparing the results with two baseline methods: (a) zero-shot generation, and (b) iterative execution-based feedback without our code complexity insights. Experiment results show that our approach makes notable improvements, particularly with a smaller LLM (GPT3.5 Turbo), where, e.g., Pass@1 increased by 35.71% compared to the baseline's improvement of 12.5% on the HumanEval dataset. The study expands experiments to BigCodeBench and integrates the method with the Reflexion code generation agent, leading to Pass@1 improvements of 20% (GPT-4o) and 23.07% (GPT-o3 mini). The results highlight that complexity-aware feedback enhances both direct LLM prompting and agent-based workflows.

摘要

随着GPT-4等大语言模型(LLM)的出现,自动代码生成技术获得了显著发展。尽管大量研究致力于提升LLM代码生成的有效性,但极少有工作尝试理解生成代码的特征并利用这些特征改进失败案例。本文以代码最直观的特征为切入点,探究代码复杂度与LLM生成代码成功率之间的关系。通过采用大量标准复杂度指标,我们首先进行实证分析以探索其与LLM代码生成性能(即Pass@1)的关联性。借助逻辑回归模型,我们识别出哪些复杂度指标最能预测代码正确性。基于这些发现,我们提出一种迭代反馈方法,通过提示LLM根据先前失败输出的复杂度指标来生成正确代码。我们在多个基准测试集(包括HumanEval、MBPP、LeetCode和BigCodeBench)和不同LLM模型(GPT-4o、GPT-3.5 Turbo、Llama 3.1和GPT-o3 mini)上验证该方法,并与两种基线方法进行比较:(a)零样本生成,(b)不包含代码复杂度分析的基于执行的迭代反馈。实验结果表明,我们的方法取得了显著改进,特别是在较小模型(GPT3.5 Turbo)上,例如在HumanEval数据集上Pass@1提升了35.71%,而基线方法仅提升12.5%。研究进一步将实验扩展到BigCodeBench,并将该方法与Reflexion代码生成代理集成,使Pass@1分别提升20%(GPT-4o)和23.07%(GPT-o3 mini)。这些结果证明,复杂度感知反馈能同时提升直接LLM提示和基于代理的工作流程的性能。


LLM Agents Should Employ Security Principles

Abstract

arXiv:2505.24019v1 Announce Type: cross Abstract: Large Language Model (LLM) agents show considerable promise for automating complex tasks using contextual reasoning; however, interactions involving multiple agents and the system's susceptibility to prompt injection and other forms of context manipulation introduce new vulnerabilities related to privacy leakage and system exploitation. This position paper argues that the well-established design principles in information security, which are commonly referred to as security principles, should be employed when deploying LLM agents at scale. Design principles such as defense-in-depth, least privilege, complete mediation, and psychological acceptability have helped guide the design of mechanisms for securing information systems over the last five decades, and we argue that their explicit and conscientious adoption will help secure agentic systems. To illustrate this approach, we introduce AgentSandbox, a conceptual framework embedding these security principles to provide safeguards throughout an agent's life-cycle. We evaluate with state-of-the-art LLMs along three dimensions: benign utility, attack utility, and attack success rate. AgentSandbox maintains high utility for its intended functions under both benign and adversarial evaluations while substantially mitigating privacy risks. By embedding secure design principles as foundational elements within emerging LLM agent protocols, we aim to promote trustworthy agent ecosystems aligned with user privacy expectations and evolving regulatory requirements.

摘要

大型语言模型(LLM)智能体在利用上下文推理实现复杂任务自动化方面展现出显著潜力,然而多智能体交互及系统对提示注入等上下文操纵手段的敏感性,引发了隐私泄露与系统滥用的新型安全风险。本立场文件提出,在规模化部署LLM智能体时应当采用信息安全领域成熟的设计原则(即安全原则)。纵深防御、最小权限、完全仲裁和心理可接受性等设计原则在过去五十年间始终指导着信息系统的安全机制设计,我们认为其明确且审慎的应用将有效保障智能体系统的安全性。为验证这一理念,我们提出AgentSandbox概念框架,通过嵌入上述安全原则为智能体全生命周期提供防护机制。基于最先进LLM模型,我们从良性功能效用、攻击效用和攻击成功率三个维度进行评估。实验表明AgentSandbox在保持预期功能高效性的同时,既能通过良性评估又能显著降低隐私风险。通过将安全设计原则作为基础要素嵌入新兴LLM智能体协议,我们致力于构建符合用户隐私预期与监管要求的可信智能体生态系统。


Diversity of Transformer Layers: One Aspect of Parameter Scaling Laws

Abstract

arXiv:2505.24009v1 Announce Type: cross Abstract: Transformers deliver outstanding performance across a wide range of tasks and are now a dominant backbone architecture for large language models (LLMs). Their task-solving performance is improved by increasing parameter size, as shown in the recent studies on parameter scaling laws. Although recent mechanistic-interpretability studies have deepened our understanding of the internal behavior of Transformers by analyzing their residual stream, the relationship between these internal mechanisms and the parameter scaling laws remains unclear. To bridge this gap, we focus on layers and their size, which mainly decide the parameter size of Transformers. For this purpose, we first theoretically investigate the layers within the residual stream through a bias-diversity decomposition. The decomposition separates (i) bias, the error of each layer's output from the ground truth, and (ii) diversity, which indicates how much the outputs of each layer differ from each other. Analyzing Transformers under this theory reveals that performance improves when individual layers make predictions close to the correct answer and remain mutually diverse. We show that diversity becomes especially critical when individual layers' outputs are far from the ground truth. Finally, we introduce an information-theoretic diversity and show our main findings that adding layers enhances performance only when those layers behave differently, i.e., are diverse. We also reveal the performance gains from increasing the number of layers exhibit submodularity: marginal improvements diminish as additional layers increase, mirroring the logarithmic convergence predicted by the parameter scaling laws. Experiments on multiple semantic-understanding tasks with various LLMs empirically confirm the theoretical properties derived in this study.

摘要

Transformer模型在广泛的任务中展现出卓越性能,现已成为大型语言模型(LLMs)的主导骨干架构。近期参数缩放定律研究表明,通过增加参数规模可提升其任务解决能力。尽管机制可解释性研究通过分析残差流深化了我们对Transformer内部行为的理解,但这些内部机制与参数缩放定律间的关联仍不明确。为弥合这一鸿沟,我们聚焦于主要决定Transformer参数规模的层及其尺寸。为此,我们首先通过偏置-多样性分解理论研究了残差流中的各层级。该分解将(i)偏置(各层输出与真实值的误差)与(ii)多样性(各层输出间的差异程度)分离。基于此理论分析表明:当各层预测接近正确答案且保持相互差异时,模型性能提升。研究发现当单层输出远离真实值时,多样性尤为关键。最后,我们引入信息论多样性指标,并揭示核心发现:仅当新增层表现异质化(即具备多样性)时,增加层数才能提升性能。同时发现层数增加带来的性能增益具有次模性:边际改进随层数增加而递减,这与参数缩放定律预测的对数收敛特性相符。通过多种LLMs在多项语义理解任务上的实验,本研究推导的理论特性得到了实证验证。


LlamaRL: A Distributed Asynchronous Reinforcement Learning Framework for Efficient Large-scale LLM Trainin

Abstract

arXiv:2505.24034v1 Announce Type: cross Abstract: Reinforcement Learning (RL) has become the most effective post-training approach for improving the capabilities of Large Language Models (LLMs). In practice, because of the high demands on latency and memory, it is particularly challenging to develop an efficient RL framework that reliably manages policy models with hundreds to thousands of billions of parameters. In this paper, we present LlamaRL, a fully distributed, asynchronous RL framework optimized for efficient training of large-scale LLMs with various model sizes (8B, 70B, and 405B parameters) on GPU clusters ranging from a handful to thousands of devices. LlamaRL introduces a streamlined, single-controller architecture built entirely on native PyTorch, enabling modularity, ease of use, and seamless scalability to thousands of GPUs. We also provide a theoretical analysis of LlamaRL's efficiency, including a formal proof that its asynchronous design leads to strict RL speed-up. Empirically, by leveraging best practices such as colocated model offloading, asynchronous off-policy training, and distributed direct memory access for weight synchronization, LlamaRL achieves significant efficiency gains -- up to 10.7x speed-up compared to DeepSpeed-Chat-like systems on a 405B-parameter policy model. Furthermore, the efficiency advantage continues to grow with increasing model scale, demonstrating the framework's suitability for future large-scale RL training.

摘要

强化学习(RL)已成为提升大语言模型(LLM)能力最有效的后训练方法。在实际应用中,由于对延迟和内存的高要求,开发一个能可靠管理数百亿至数千亿参数策略模型的高效RL框架尤为困难。本文提出LlamaRL——一个完全分布式、异步的RL框架,专为在GPU集群(从少量到数千台设备)上高效训练不同规模(8B、70B和405B参数)的大规模LLM而优化。LlamaRL采用基于原生PyTorch构建的简洁单控制器架构,具备模块化、易用性及无缝扩展至数千GPU的能力。我们还对LlamaRL的效率进行了理论分析,包括形式化证明其异步设计能带来严格的RL加速。实证表明,通过采用共置模型卸载、异步离策略训练及分布式直接内存访问进行权重同步等最佳实践,LlamaRL实现了显著的效率提升——在405B参数策略模型上相比类DeepSpeed-Chat系统最高可获得10.7倍加速。此外,该效率优势随模型规模扩大持续增长,证明了该框架对未来大规模RL训练的适用性。


MedPAIR: Measuring Physicians and AI Relevance Alignment in Medical Question Answering

Abstract

arXiv:2505.24040v1 Announce Type: cross Abstract: Large Language Models (LLMs) have demonstrated remarkable performance on various medical question-answering (QA) benchmarks, including standardized medical exams. However, correct answers alone do not ensure correct logic, and models may reach accurate conclusions through flawed processes. In this study, we introduce the MedPAIR (Medical Dataset Comparing Physicians and AI Relevance Estimation and Question Answering) dataset to evaluate how physician trainees and LLMs prioritize relevant information when answering QA questions. We obtain annotations on 1,300 QA pairs from 36 physician trainees, labeling each sentence within the question components for relevance. We compare these relevance estimates to those for LLMs, and further evaluate the impact of these "relevant" subsets on downstream task performance for both physician trainees and LLMs. We find that LLMs are frequently not aligned with the content relevance estimates of physician trainees. After filtering out physician trainee-labeled irrelevant sentences, accuracy improves for both the trainees and the LLMs. All LLM and physician trainee-labeled data are available at: http://medpair.csail.mit.edu/.

摘要

大型语言模型(LLMs)在各类医学问答基准测试(包括标准化医学考试)中展现出卓越性能。然而,正确答案本身并不能确保逻辑的正确性,模型可能通过有缺陷的推理过程得出准确结论。本研究引入MedPAIR(医学数据集:医师与人工智能相关性评估及问答比较)数据集,用于评估医师培训生与LLMs在回答问答题目时如何优先处理相关信息。我们获取了36名医师培训生对1,300个问答对的标注数据,对问题组件中的每个句子进行相关性标记。将这些相关性评估与LLMs的结果进行对比,并进一步评估这些"相关"子集对医师培训生和LLMs下游任务表现的影响。研究发现,LLMs与医师培训生的内容相关性评估经常不一致。在过滤掉医师培训生标记为无关的句子后,医师培训生和LLMs的准确率均有所提升。所有LLM及医师培训生标注数据详见:http://medpair.csail.mit.edu/。


R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration

Abstract

arXiv:2505.24133v1 Announce Type: cross Abstract: Reasoning models have demonstrated impressive performance in self-reflection and chain-of-thought reasoning. However, they often produce excessively long outputs, leading to prohibitively large key-value (KV) caches during inference. While chain-of-thought inference significantly improves performance on complex reasoning tasks, it can also lead to reasoning failures when deployed with existing KV cache compression approaches. To address this, we propose Redundancy-aware KV Cache Compression for Reasoning models (R-KV), a novel method specifically targeting redundant tokens in reasoning models. Our method preserves nearly 100% of the full KV cache performance using only 10% of the KV cache, substantially outperforming existing KV cache baselines, which reach only 60% of the performance. Remarkably, R-KV even achieves 105% of full KV cache performance with 16% of the KV cache. This KV-cache reduction also leads to a 90% memory saving and a 6.6X throughput over standard chain-of-thought reasoning inference. Experimental results show that R-KV consistently outperforms existing KV cache compression baselines across two mathematical reasoning datasets.

摘要

推理模型在自我反思和思维链推理中展现出卓越性能,但其输出往往过长,导致推理过程中键值(KV)缓存量激增。虽然思维链推理能显著提升复杂推理任务的性能,但当采用现有KV缓存压缩方法时,也可能引发推理失败。为此,我们提出面向推理模型的冗余感知KV缓存压缩方法(R-KV),该方法专门针对推理模型中的冗余令牌进行优化。我们的方法仅需10%的KV缓存即可保持近100%的全缓存性能,显著优于现有基线方法(后者仅能达到60%性能)。值得注意的是,当使用16%的KV缓存时,R-KV甚至能达到105%的全缓存性能。这种KV缓存的缩减还可实现90%的内存节省,并使推理吞吐量达到标准思维链推理的6.6倍。实验结果表明,在两个数学推理数据集上,R-KV始终优于现有KV缓存压缩基线方法。


TCM-Ladder: A Benchmark for Multimodal Question Answering on Traditional Chinese Medicine

Abstract

arXiv:2505.24063v1 Announce Type: cross Abstract: Traditional Chinese Medicine (TCM), as an effective alternative medicine, has been receiving increasing attention. In recent years, the rapid development of large language models (LLMs) tailored for TCM has underscored the need for an objective and comprehensive evaluation framework to assess their performance on real-world tasks. However, existing evaluation datasets are limited in scope and primarily text-based, lacking a unified and standardized multimodal question-answering (QA) benchmark. To address this issue, we introduce TCM-Ladder, the first multimodal QA dataset specifically designed for evaluating large TCM language models. The dataset spans multiple core disciplines of TCM, including fundamental theory, diagnostics, herbal formulas, internal medicine, surgery, pharmacognosy, and pediatrics. In addition to textual content, TCM-Ladder incorporates various modalities such as images and videos. The datasets were constructed using a combination of automated and manual filtering processes and comprise 52,000+ questions in total. These questions include single-choice, multiple-choice, fill-in-the-blank, diagnostic dialogue, and visual comprehension tasks. We trained a reasoning model on TCM-Ladder and conducted comparative experiments against 9 state-of-the-art general domain and 5 leading TCM-specific LLMs to evaluate their performance on the datasets. Moreover, we propose Ladder-Score, an evaluation method specifically designed for TCM question answering that effectively assesses answer quality regarding terminology usage and semantic expression. To our knowledge, this is the first work to evaluate mainstream general domain and TCM-specific LLMs on a unified multimodal benchmark. The datasets and leaderboard are publicly available at https://tcmladder.com or https://54.211.107.106 and will be continuously updated.

摘要

作为有效的替代医学,传统中医药(TCM)正受到越来越多的关注。近年来,针对中医药领域定制的大语言模型(LLMs)快速发展,凸显了对客观全面评估框架的需求,以评估这些模型在真实任务中的表现。然而,现有评估数据集范围有限且主要为文本形式,缺乏统一标准化的多模态问答(QA)基准。为此,我们提出了TCM-Ladder,这是首个专门用于评估中医药大语言模型的多模态问答数据集。该数据集涵盖中医药多个核心学科,包括基础理论、诊断学、方剂学、内科学、外科学、生药学及儿科学。除文本内容外,TCM-Ladder还整合了图像、视频等多种模态。数据集通过自动化与人工筛选相结合的方式构建,共计包含52,000余道问题,题型涵盖单选题、多选题、填空题、诊断对话及视觉理解任务。我们在TCM-Ladder上训练了推理模型,并对9个前沿通用领域及5个领先的中医药专用LLMs进行了对比实验以评估其表现。此外,我们提出了Ladder-Score这一专为中医药问答设计的评估方法,可有效衡量答案在术语使用和语义表达方面的质量。据我们所知,这是首个在统一多模态基准上评估主流通用领域与中医药专用LLMs的研究。数据集与排行榜已公开于https://tcmladder.com或https://54.211.107.106,并将持续更新。


AMSbench: A Comprehensive Benchmark for Evaluating MLLM Capabilities in AMS Circuits

Abstract

arXiv:2505.24138v1 Announce Type: cross Abstract: Analog/Mixed-Signal (AMS) circuits play a critical role in the integrated circuit (IC) industry. However, automating Analog/Mixed-Signal (AMS) circuit design has remained a longstanding challenge due to its difficulty and complexity. Recent advances in Multi-modal Large Language Models (MLLMs) offer promising potential for supporting AMS circuit analysis and design. However, current research typically evaluates MLLMs on isolated tasks within the domain, lacking a comprehensive benchmark that systematically assesses model capabilities across diverse AMS-related challenges. To address this gap, we introduce AMSbench, a benchmark suite designed to evaluate MLLM performance across critical tasks including circuit schematic perception, circuit analysis, and circuit design. AMSbench comprises approximately 8000 test questions spanning multiple difficulty levels and assesses eight prominent models, encompassing both open-source and proprietary solutions such as Qwen 2.5-VL and Gemini 2.5 Pro. Our evaluation highlights significant limitations in current MLLMs, particularly in complex multi-modal reasoning and sophisticated circuit design tasks. These results underscore the necessity of advancing MLLMs' understanding and effective application of circuit-specific knowledge, thereby narrowing the existing performance gap relative to human expertise and moving toward fully automated AMS circuit design workflows. Our data is released at https://huggingface.co/datasets/wwhhyy/AMSBench

摘要

模拟/混合信号(AMS)电路在集成电路(IC)产业中具有关键作用。然而,由于其难度和复杂性,实现模拟/混合信号电路设计的自动化始终是一项长期挑战。多模态大语言模型(MLLM)的最新进展为支持AMS电路分析与设计提供了潜在可能。但目前研究通常仅针对该领域孤立任务评估MLLM性能,缺乏能系统评估模型在多样化AMS相关挑战中综合能力的基准测试。为此,我们提出AMSbench基准测试套件,用于评估MLLM在电路原理图识别、电路分析和电路设计等关键任务中的表现。该套件包含约8000道涵盖多难度级别的测试题目,并对包括Qwen 2.5-VL和Gemini 2.5 Pro在内的8个主流开源与商业模型进行了评估。研究结果表明当前MLLM存在显著局限性,尤其在复杂多模态推理和精密电路设计任务中表现欠佳。这些发现强调了提升MLLM对电路专业知识的理解与应用能力的必要性,从而缩小现有模型与人类专家水平的性能差距,推动实现全自动化的AMS电路设计流程。相关数据已发布于https://huggingface.co/datasets/wwhhyy/AMSBench。


DSR-Bench: Evaluating the Structural Reasoning Abilities of LLMs via Data Structures

Abstract

arXiv:2505.24069v1 Announce Type: cross Abstract: Large language models (LLMs) are increasingly deployed for real-world tasks that fundamentally involve data manipulation. A core requirement across these tasks is the ability to perform structural reasoning--that is, to understand and reason about data relationships. For example, customer requests require a temporal ordering, which can be represented by data structures such as queues. However, existing benchmarks primarily focus on high-level, application-driven evaluations without isolating this fundamental capability. To address this gap, we introduce DSR-Bench, a novel benchmark evaluating LLMs' structural reasoning capabilities through data structures, which provide interpretable representations of data relationships. DSR-Bench includes 20 data structures, 35 operations, and 4,140 problem instances, organized hierarchically for fine-grained analysis of reasoning limitations. Our evaluation pipeline is fully automated and deterministic, eliminating subjective human or model-based judgments. Its synthetic nature also ensures scalability and minimizes data contamination risks. We benchmark nine state-of-the-art LLMs. Our analysis shows that instruction-tuned models struggle with basic multi-attribute and multi-hop reasoning. Furthermore, while reasoning-oriented models perform better, they remain fragile on complex and hybrid structures, with the best model achieving an average score of only 47% on the challenge subset. Crucially, models often perform poorly on multi-dimensional data and natural language task descriptions, highlighting a critical gap for real-world deployment.

摘要

大型语言模型(LLMs)正日益被部署于本质上涉及数据操作的实际任务中。这些任务的核心要求是具备结构推理能力——即理解并推断数据关系的能力。例如,客户请求需要时间排序,这可以通过队列等数据结构来表示。然而,现有基准测试主要关注高层次、应用驱动的评估,而未隔离这一基础能力。为填补这一空白,我们提出了DSR-Bench,这是一个通过数据结构评估LLMs结构推理能力的新型基准测试,数据结构为数据关系提供了可解释的表示形式。DSR-Bench包含20种数据结构、35种操作和4,140个问题实例,采用分层组织以实现对推理局限性的细粒度分析。我们的评估流程完全自动化且具有确定性,消除了主观的人为或基于模型的判断。其合成性质还确保了可扩展性并最小化了数据污染风险。我们对九种最先进的LLMs进行了基准测试。分析表明,经过指令调优的模型在基本多属性和多跳推理方面表现欠佳。此外,尽管面向推理的模型表现更好,但在复杂和混合结构上仍然脆弱,最佳模型在挑战子集上的平均得分仅为47%。关键的是,模型在多维数据和自然语言任务描述上往往表现不佳,这凸显了实际部署中的关键差距。


Don't Just Follow MLLM Plans: Robust and Efficient Planning for Open-world Agents

Abstract

arXiv:2505.24157v1 Announce Type: cross Abstract: Developing autonomous agents capable of mastering complex, multi-step tasks in unpredictable, interactive environments presents a significant challenge. While Large Language Models (LLMs) offer promise for planning, existing approaches often rely on problematic internal knowledge or make unrealistic environmental assumptions. Although recent work explores learning planning knowledge, they still retain limitations due to partial reliance on external knowledge or impractical setups. Indeed, prior research has largely overlooked developing agents capable of acquiring planning knowledge from scratch, directly in realistic settings. While realizing this capability is necessary, it presents significant challenges, primarily achieving robustness given the substantial risk of incorporating LLMs' inaccurate knowledge. Moreover, efficiency is crucial for practicality as learning can demand prohibitive exploration. In response, we introduce Robust and Efficient Planning for Open-world Agents (REPOA), a novel framework designed to tackle these issues. REPOA features three key components: adaptive dependency learning and fine-grained failure-aware operation memory to enhance robustness to knowledge inaccuracies, and difficulty-based exploration to improve learning efficiency. Our evaluation in two established open-world testbeds demonstrates REPOA's robust and efficient planning, showcasing its capability to successfully obtain challenging late-game items that were beyond the reach of prior approaches.

摘要

开发能够在不可预测的交互式环境中掌握复杂多步骤任务的自主代理系统是一项重大挑战。尽管大语言模型(LLMs)为任务规划提供了潜力,现有方法往往依赖于有问题的内部知识或做出不切实际的环境假设。虽然近期研究探索了规划知识学习,但由于部分依赖外部知识或不切实际的实验设置,这些方法仍存在局限。事实上,先前研究大多忽视了开发能够在真实场景中从零开始直接获取规划知识的智能体。实现这一能力虽然必要,却面临重大挑战——主要是在整合LLMs不准确知识的高风险下保持系统鲁棒性。此外,由于学习过程可能需要进行大量探索,效率问题对实际应用至关重要。为此,我们提出了开放世界智能体鲁棒高效规划框架(REPOA),该创新框架包含三大核心组件:通过自适应依赖学习和细粒度故障感知操作记忆来增强对知识不准确性的鲁棒性,以及基于难度的探索机制来提高学习效率。在两个成熟的开放世界测试平台上的评估表明,REPOA实现了鲁棒高效的规划能力,其成功获取高难度终局物品的表现超越了现有所有方法。


A Reward-driven Automated Webshell Malicious-code Generator for Red-teaming

Abstract

arXiv:2505.24252v1 Announce Type: cross Abstract: Frequent cyber-attacks have elevated WebShell exploitation and defense to a critical research focus within network security. However, there remains a significant shortage of publicly available, well-categorized malicious-code datasets organized by obfuscation method. Existing malicious-code generation methods, which primarily rely on prompt engineering, often suffer from limited diversity and high redundancy in the payloads they produce. To address these limitations, we propose \textbf{RAWG}, a \textbf{R}eward-driven \textbf{A}utomated \textbf{W}ebshell Malicious-code \textbf{G}enerator designed for red-teaming applications. Our approach begins by categorizing webshell samples from common datasets into seven distinct types of obfuscation. We then employ a large language model (LLM) to extract and normalize key tokens from each sample, creating a standardized, high-quality corpus. Using this curated dataset, we perform supervised fine-tuning (SFT) on an open-source large model to enable the generation of diverse, highly obfuscated webshell malicious payloads. To further enhance generation quality, we apply Proximal Policy Optimization (PPO), treating malicious-code samples as "chosen" data and benign code as "rejected" data during reinforcement learning. Extensive experiments demonstrate that RAWG significantly outperforms current state-of-the-art methods in both payload diversity and escape effectiveness.


S4-Driver: Scalable Self-Supervised Driving Multimodal Large Language Modelwith Spatio-Temporal Visual Representation

Abstract

arXiv:2505.24139v1 Announce Type: cross Abstract: The latest advancements in multi-modal large language models (MLLMs) have spurred a strong renewed interest in end-to-end motion planning approaches for autonomous driving. Many end-to-end approaches rely on human annotations to learn intermediate perception and prediction tasks, while purely self-supervised approaches--which directly learn from sensor inputs to generate planning trajectories without human annotations often underperform the state of the art. We observe a key gap in the input representation space: end-to-end approaches built on MLLMs are often pretrained with reasoning tasks in 2D image space rather than the native 3D space in which autonomous vehicles plan. To this end, we propose S4-Driver, a scalable self-supervised motion planning algorithm with spatio-temporal visual representation, based on the popular PaLI multimodal large language model. S4-Driver uses a novel sparse volume strategy to seamlessly transform the strong visual representation of MLLMs from perspective view to 3D space without the need to finetune the vision encoder. This representation aggregates multi-view and multi-frame visual inputs and enables better prediction of planning trajectories in 3D space. To validate our method, we run experiments on both nuScenes and Waymo Open Motion Dataset (with in-house camera data). Results show that S4-Driver performs favorably against existing supervised multi-task approaches while requiring no human annotations. It also demonstrates great scalability when pretrained on large volumes of unannotated driving logs.

摘要

多模态大语言模型(MLLMs)的最新进展重新激发了人们对自动驾驶端到端运动规划方法的强烈兴趣。现有端到端方法多依赖人工标注数据来学习中间感知与预测任务,而纯自监督方法——即直接从传感器输入生成规划轨迹且无需人工标注的方案——其性能往往落后于当前最优水平。我们发现在输入表征空间存在关键差异:基于MLLMs构建的端到端方法通常在二维图像空间进行推理任务预训练,而非自动驾驶车辆实际规划的原始三维空间。为此,我们提出S4-Driver算法,这是一种基于流行PaLI多模态大语言模型、具有时空视觉表征的可扩展自监督运动规划方法。该算法采用创新的稀疏体素策略,无需微调视觉编码器即可将MLLMs强大的视觉表征从透视图无缝转换至三维空间。该表征聚合多视角多帧视觉输入,能更精准预测三维空间的规划轨迹。我们在nuScenes和Waymo开放运动数据集(含内部摄像头数据)上的实验表明:S4-Driver在无需人工标注的情况下,性能优于现有监督式多任务方法,并展现出在大规模无标注驾驶日志上预训练时的卓越扩展性。


Fine-Tune an SLM or Prompt an LLM? The Case of Generating Low-Code Workflows

Abstract

arXiv:2505.24189v1 Announce Type: cross Abstract: Large Language Models (LLMs) such as GPT-4o can handle a wide range of complex tasks with the right prompt. As per token costs are reduced, the advantages of fine-tuning Small Language Models (SLMs) for real-world applications -- faster inference, lower costs -- may no longer be clear. In this work, we present evidence that, for domain-specific tasks that require structured outputs, SLMs still have a quality advantage. We compare fine-tuning an SLM against prompting LLMs on the task of generating low-code workflows in JSON form. We observe that while a good prompt can yield reasonable results, fine-tuning improves quality by 10% on average. We also perform systematic error analysis to reveal model limitations.

摘要

诸如GPT-4o之类的大型语言模型(LLMs)在适当提示下能够处理多种复杂任务。随着单令牌成本降低,为实际应用微调小型语言模型(SLMs)的优势——更快的推理速度、更低廉的成本——可能不再显著。本研究表明,对于需要结构化输出的特定领域任务,SLMs仍具有质量优势。我们比较了在生成JSON格式低代码工作流的任务中微调SLM与提示LLM的效果,发现虽然优质提示可获得合理结果,但微调平均能提升10%的质量。我们还通过系统性错误分析揭示了模型的局限性。


SALE : Low-bit Estimation for Efficient Sparse Attention in Long-context LLM Prefilling

Abstract

arXiv:2505.24179v1 Announce Type: cross Abstract: Many advanced Large Language Model (LLM) applications require long-context processing, but the self-attention module becomes a bottleneck during the prefilling stage of inference due to its quadratic time complexity with respect to sequence length. Existing sparse attention methods accelerate attention computation by skipping less significant regions of the attention map. However, these approaches typically perform coarse-grained inspection of the attention map, rendering considerable loss in model accuracy. In this paper, we propose SALE, a fine-grained sparse attention method that accelerates the long-context prefilling stage of LLM with negligible loss in model accuracy. SALE achieves fast and accurate fine-grained attention weight estimation through 4-bit quantized query-key products, followed by block-sparse attention to accelerate prefilling computations. For importance evaluation for query-key pairs, we adopt our Relative Attention Score metric, which offers significantly higher efficiency within our framework. We implement a custom CUDA kernel optimized for our approach for hardware efficiency, reducing the additional overhead to approximately 11% of the full attention latency. Notably, SALE requires no parameter training and can be seamlessly integrated into existing systems with trivial code modifications. Experiments on long-context benchmarks demonstrate that our method outperforms existing approaches in accuracy-efficiency trade-offs, achieving at least 3.36x speedups on Llama-3.1-8B for sequences longer than 64K while maintaining model quality.

摘要

许多先进的大语言模型(LLM)应用需要长上下文处理,但由于自注意力模块在推理预填充阶段的时间复杂度与序列长度呈二次方关系,其成为性能瓶颈。现有稀疏注意力方法通过跳过注意力图中较不重要的区域来加速计算,但这些方法通常对注意力图进行粗粒度检测,导致模型准确性显著下降。本文提出SALE,一种细粒度稀疏注意力方法,能在模型精度损失可忽略的前提下加速LLM的长上下文预填充阶段。SALE通过4位量化的查询-键乘积实现快速精确的细粒度注意力权重估计,再采用块稀疏注意力加速预填充计算。针对查询-键对的重要性评估,我们采用相对注意力分数指标,该指标在我们的框架内具有显著更高的效率。我们实现了专为该方案优化的定制CUDA内核,将额外开销降至完整注意力延迟的约11%。值得注意的是,SALE无需参数训练,仅需极少量代码修改即可无缝集成到现有系统中。长上下文基准测试表明,本方法在精度-效率权衡上优于现有方案,在Llama-3.1-8B模型上对超过64K长度的序列实现至少3.36倍加速,同时保持模型质量。


LKD-KGC: Domain-Specific KG Construction via LLM-driven Knowledge Dependency Parsing

Abstract

arXiv:2505.24163v1 Announce Type: cross Abstract: Knowledge Graphs (KGs) structure real-world entities and their relationships into triples, enhancing machine reasoning for various tasks. While domain-specific KGs offer substantial benefits, their manual construction is often inefficient and requires specialized knowledge. Recent approaches for knowledge graph construction (KGC) based on large language models (LLMs), such as schema-guided KGC and reference knowledge integration, have proven efficient. However, these methods are constrained by their reliance on manually defined schema, single-document processing, and public-domain references, making them less effective for domain-specific corpora that exhibit complex knowledge dependencies and specificity, as well as limited reference knowledge. To address these challenges, we propose LKD-KGC, a novel framework for unsupervised domain-specific KG construction. LKD-KGC autonomously analyzes document repositories to infer knowledge dependencies, determines optimal processing sequences via LLM driven prioritization, and autoregressively generates entity schema by integrating hierarchical inter-document contexts. This schema guides the unsupervised extraction of entities and relationships, eliminating reliance on predefined structures or external knowledge. Extensive experiments show that compared with state-of-the-art baselines, LKD-KGC generally achieves improvements of 10% to 20% in both precision and recall rate, demonstrating its potential in constructing high-quality domain-specific KGs.

摘要

知识图谱(KGs)将现实世界中的实体及其关系组织为三元组,从而增强机器在各类任务中的推理能力。尽管领域专用知识图谱具有显著优势,但其人工构建过程通常效率低下且需要专业知识。当前基于大语言模型(LLM)的知识图谱构建方法(如模式引导构建和参考知识集成)已被证明具有高效性,但这些方法受限于对人工定义模式的依赖、单文档处理能力以及公共领域参考知识,导致其在处理具有复杂知识依赖性与领域特异性、且参考知识有限的领域专用语料时效果欠佳。为解决这些问题,我们提出LKD-KGC——一种无监督领域专用知识图谱构建新框架。该框架通过自主分析文档库推断知识依赖关系,利用LLM驱动的优先级排序确定最优处理序列,并通过集成层次化文档间上下文自回归生成实体模式。该模式可指导实体与关系的无监督抽取,无需依赖预定义结构或外部知识。大量实验表明,相较于最先进的基线方法,LKD-KGC在精确率与召回率上普遍实现10%至20%的提升,展现了其在构建高质量领域专用知识图谱方面的潜力。


Reasoning Can Hurt the Inductive Abilities of Large Language Models

Abstract

arXiv:2505.24225v1 Announce Type: cross Abstract: Large Language Models (LLMs) have shown remarkable progress across domains, yet their ability to perform inductive reasoning - inferring latent rules from sparse examples - remains limited. It is often assumed that chain-of-thought (CoT) prompting, as used in Large Reasoning Models (LRMs), enhances such reasoning. We investigate this assumption with creating four controlled, diagnostic game-based tasks - chess, Texas Hold'em, dice games, and blackjack - with hidden human-defined rules. We find that CoT reasoning can degrade inductive performance, with LRMs often underperforming their non-reasoning counterparts. To explain this, we present a theoretical framework that reveals how reasoning steps can amplify error through three failure modes: incorrect sub-task decomposition, incorrect sub-task solving, and incorrect final answer summarization. Based on our theoretical and empirical analysis, we introduce structured interventions that adapt CoT generation according to our identified failure types. These interventions improve inductive accuracy without retraining. Our findings suggest that effective (CoT) reasoning depends not only on taking more steps but also on ensuring those steps are well-structured.

摘要

尽管大语言模型(LLMs)在各领域展现出显著进展,但其执行归纳推理(从稀疏示例中推断潜在规则)的能力仍然有限。通常认为,大型推理模型(LRMs)采用的思维链(CoT)提示能增强此类推理。我们通过创建四个受控的诊断性游戏任务(国际象棋、德州扑克、骰子游戏和二十一点)来验证这一假设,这些任务包含人类定义的隐藏规则。研究发现,CoT推理可能削弱归纳性能,LRMs的表现常逊于非推理模型。

为解释这一现象,我们提出一个理论框架,揭示推理步骤如何通过三种失败模式放大误差:错误的子任务分解、错误的子任务求解以及错误的最终答案汇总。基于理论和实证分析,我们引入结构化干预措施,根据已识别的失败类型调整CoT生成。这些干预措施无需重新训练即可提升归纳准确率。研究结果表明,有效的(CoT)推理不仅取决于步骤数量,更依赖于步骤结构的合理性。


Seeing is Not Reasoning: MVPBench for Graph-based Evaluation of Multi-path Visual Physical CoT

Abstract

arXiv:2505.24182v1 Announce Type: cross Abstract: Understanding the physical world - governed by laws of motion, spatial relations, and causality - poses a fundamental challenge for multimodal large language models (MLLMs). While recent advances such as OpenAI o3 and GPT-4o demonstrate impressive perceptual and reasoning capabilities, our investigation reveals these models struggle profoundly with visual physical reasoning, failing to grasp basic physical laws, spatial interactions, and causal effects in complex scenes. More importantly, they often fail to follow coherent reasoning chains grounded in visual evidence, especially when multiple steps are needed to arrive at the correct answer. To rigorously evaluate this capability, we introduce MVPBench, a curated benchmark designed to rigorously evaluate visual physical reasoning through the lens of visual chain-of-thought (CoT). Each example features interleaved multi-image inputs and demands not only the correct final answer but also a coherent, step-by-step reasoning path grounded in evolving visual cues. This setup mirrors how humans reason through real-world physical processes over time. To ensure fine-grained evaluation, we introduce a graph-based CoT consistency metric that verifies whether the reasoning path of model adheres to valid physical logic. Additionally, we minimize shortcut exploitation from text priors, encouraging models to rely on visual understanding. Experimental results reveal a concerning trend: even cutting-edge MLLMs exhibit poor visual reasoning accuracy and weak image-text alignment in physical domains. Surprisingly, RL-based post-training alignment - commonly believed to improve visual reasoning performance - often harms spatial reasoning, suggesting a need to rethink current fine-tuning practices.

摘要

理解由运动定律、空间关系和因果性支配的物理世界,对多模态大语言模型(MLLMs)构成了根本性挑战。尽管OpenAI o3和GPT-4o等最新进展展现了令人印象深刻的感知与推理能力,但我们的研究表明,这些模型在视觉物理推理方面存在严重缺陷:它们难以把握复杂场景中的基本物理定律、空间交互和因果效应。更重要的是,当需要多步推理才能得出正确答案时,这些模型往往无法遵循基于视觉证据的连贯推理链条。为系统评估该能力,我们提出了MVPBench——一个通过视觉思维链(CoT)框架严格评估视觉物理推理的精选基准。每个测试样例均包含交错的多图像输入,不仅要求最终答案正确,还需提供基于动态视觉线索的连贯分步推理路径,这种设置模拟了人类在真实物理过程中随时间推进的推理方式。为确保细粒度评估,我们提出基于图的CoT一致性度量标准,用于验证模型推理路径是否符合有效物理逻辑。此外,我们最大限度减少了文本先验的捷径利用,促使模型依赖视觉理解。实验结果揭示了令人担忧的趋势:即使最先进的MLLMs在物理领域的视觉推理准确率和图文对齐能力也表现不佳。令人意外的是,基于强化学习的训练后对齐(通常被认为能提升视觉推理性能)往往会损害空间推理能力,这表明需要重新审视当前微调实践。


From Hallucinations to Jailbreaks: Rethinking the Vulnerability of Large Foundation Models

Abstract

arXiv:2505.24232v1 Announce Type: cross Abstract: Large foundation models (LFMs) are susceptible to two distinct vulnerabilities: hallucinations and jailbreak attacks. While typically studied in isolation, we observe that defenses targeting one often affect the other, hinting at a deeper connection. We propose a unified theoretical framework that models jailbreaks as token-level optimization and hallucinations as attention-level optimization. Within this framework, we establish two key propositions: (1) \textit{Similar Loss Convergence} - the loss functions for both vulnerabilities converge similarly when optimizing for target-specific outputs; and (2) \textit{Gradient Consistency in Attention Redistribution} - both exhibit consistent gradient behavior driven by shared attention dynamics. We validate these propositions empirically on LLaVA-1.5 and MiniGPT-4, showing consistent optimization trends and aligned gradients. Leveraging this connection, we demonstrate that mitigation techniques for hallucinations can reduce jailbreak success rates, and vice versa. Our findings reveal a shared failure mode in LFMs and suggest that robustness strategies should jointly address both vulnerabilities.

摘要

大型基础模型(LFMs)存在两种显著缺陷:幻觉和越狱攻击。尽管通常被独立研究,但我们发现针对其中一种缺陷的防御措施往往会影响另一种,这暗示着两者存在深层关联。我们提出一个统一的理论框架,将越狱攻击建模为令牌级优化,将幻觉建模为注意力级优化。在该框架中,我们确立了两个关键命题:(1)相似损失收敛——当针对特定目标输出进行优化时,两种缺陷的损失函数呈现相似收敛特性;(2)注意力重分配的梯度一致性——两者均表现出由共享注意力动态驱动的一致性梯度行为。我们在LLaVA-1.5和MiniGPT-4上实证验证了这些命题,显示出一致的优化趋势和梯度对齐。利用这种关联性,我们证明缓解幻觉的技术可降低越狱成功率,反之亦然。本研究揭示了LFMs的共享故障模式,并表明鲁棒性策略应协同应对这两种缺陷。


Effects of Theory of Mind and Prosocial Beliefs on Steering Human-Aligned Behaviors of LLMs in Ultimatum Games

Abstract

arXiv:2505.24255v1 Announce Type: cross Abstract: Large Language Models (LLMs) have shown potential in simulating human behaviors and performing theory-of-mind (ToM) reasoning, a crucial skill for complex social interactions. In this study, we investigate the role of ToM reasoning in aligning agentic behaviors with human norms in negotiation tasks, using the ultimatum game as a controlled environment. We initialized LLM agents with different prosocial beliefs (including Greedy, Fair, and Selfless) and reasoning methods like chain-of-thought (CoT) and varying ToM levels, and examined their decision-making processes across diverse LLMs, including reasoning models like o3-mini and DeepSeek-R1 Distilled Qwen 32B. Results from 2,700 simulations indicated that ToM reasoning enhances behavior alignment, decision-making consistency, and negotiation outcomes. Consistent with previous findings, reasoning models exhibit limited capability compared to models with ToM reasoning, different roles of the game benefits with different orders of ToM reasoning. Our findings contribute to the understanding of ToM's role in enhancing human-AI interaction and cooperative decision-making. The code used for our experiments can be found at https://github.com/Stealth-py/UltimatumToM.

摘要

大语言模型(LLMs)在模拟人类行为和执行心智理论(ToM)推理方面展现出潜力,这种能力对复杂社会互动至关重要。本研究以最后通牒游戏为受控环境,探讨ToM推理在谈判任务中使智能体行为与人类规范对齐的作用。我们为LLM智能体设定了不同亲社会信念(包括贪婪型、公平型和无私型)及思维链(CoT)等推理方法,并采用不同ToM层级,在包括o3-mini和DeepSeek-R1 Distilled Qwen 32B等推理模型在内的多种LLMs中检验其决策过程。2,700次模拟实验结果表明,ToM推理能提升行为对齐性、决策一致性及谈判结果。与先前研究一致,纯推理模型相比具备ToM推理的模型表现有限,且不同ToM推理顺序对游戏收益分配产生差异化影响。本发现为理解ToM在促进人机交互与合作决策中的作用提供了新见解。实验代码详见https://github.com/Stealth-py/UltimatumToM。


Large Language Models are Locally Linear Mappings

Abstract

arXiv:2505.24293v1 Announce Type: cross Abstract: We demonstrate that the inference operations of several open-weight large language models (LLMs) can be mapped to an exactly equivalent linear system for an input sequence without modifying the model weights or altering output predictions. Extending techniques from image diffusion models that exhibit local or piecewise linearity, we strategically alter the gradient computation with respect to a given input sequence for a next-token prediction such that the Jacobian of the model nearly exactly reproduces the forward prediction with a linear system. We demonstrate this approach across models (Llama 3, Gemma 3, Qwen 3, Phi 4, Mistral Ministral and OLMo 2, up to Llama 3.3 70B Q4) and show through the singular value decomposition of the detached Jacobian that these LLMs operate in extremely low-dimensional subspaces where many of the largest singular vectors decode to concepts related to the most-likely output token. This approach also allows us to examine the operation of each successive layer (and its attention and MLP components) as nearly-exact linear systems and observe the emergence of semantic concepts. Despite their expressive power and global nonlinearity, modern LLMs can be interpreted through nearly-exact locally linear decompositions that provide insights into their internal representations and reveal interpretable semantic structures in the next-token prediction process.

摘要

我们证明,在不修改模型权重或改变输出预测的情况下,可将多个开源权重大型语言模型(LLM)的推理操作映射为输入序列的完全等效线性系统。通过扩展图像扩散模型中表现出的局部或分段线性技术,我们策略性地改变了给定输入序列在下一词元预测中的梯度计算,使得模型的雅可比矩阵几乎完全复现了线性系统的前向预测。我们在多个模型(Llama 3、Gemma 3、Qwen 3、Phi 4、Mistral Ministral及OLMo 2,最高至Llama 3.3 70B Q4)上验证了该方法,并通过分离雅可比矩阵的奇异值分解表明:这些LLM在极低维子空间中运行,其中多数最大奇异向量可解码为与最可能输出词元相关的概念。该方法还使我们能够将每一连续层(及其注意力与MLP组件)的操作视为近乎精确的线性系统,并观察语义概念的涌现现象。尽管现代LLM具有强大的表达能力和全局非线性特征,但通过近乎精确的局部线性分解仍可对其解释,这为理解其内部表征提供了新视角,并揭示了下一词元预测过程中可解释的语义结构。


Faithful and Robust LLM-Driven Theorem Proving for NLI Explanations

Abstract

arXiv:2505.24264v1 Announce Type: cross Abstract: Natural language explanations play a fundamental role in Natural Language Inference (NLI) by revealing how premises logically entail hypotheses. Recent work has shown that the interaction of large language models (LLMs) with theorem provers (TPs) can help verify and improve the validity of NLI explanations. However, TPs require translating natural language into machine-verifiable formal representations, a process that introduces the risk of semantic information loss and unfaithful interpretation, an issue compounded by LLMs' challenges in capturing critical logical structures with sufficient precision. Moreover, LLMs are still limited in their capacity for rigorous and robust proof construction within formal verification frameworks. To mitigate issues related to faithfulness and robustness, this paper investigates strategies to (1) alleviate semantic loss during autoformalisation, (2) efficiently identify and correct syntactic errors in logical representations, (3) explicitly use logical expressions to guide LLMs in generating structured proof sketches, and (4) increase LLMs' capacity of interpreting TP's feedback for iterative refinement. Our empirical results on e-SNLI, QASC and WorldTree using different LLMs demonstrate that the proposed strategies yield significant improvements in autoformalisation (+18.46%, +34.2%, +39.77%) and explanation refinement (+29.5%, +51.5%, +41.25%) over the state-of-the-art model. Moreover, we show that specific interventions on the hybrid LLM-TP architecture can substantially improve efficiency, drastically reducing the number of iterations required for successful verification.

摘要

自然语言解释通过揭示前提如何逻辑蕴含假设,在自然语言推理(NLI)中发挥着基础性作用。近期研究表明,大型语言模型(LLMs)与定理证明器(TPs)的交互有助于验证和改进NLI解释的有效性。然而,TPs需要将自然语言转换为机器可验证的形式化表示,这一过程可能导致语义信息丢失和解释失真,而LLMs在精确捕捉关键逻辑结构方面的不足进一步加剧了该问题。此外,LLMs在形式化验证框架内构建严格且稳健证明的能力仍存在局限。为缓解忠实性与鲁棒性问题,本文研究以下策略:(1)减轻自动形式化过程中的语义损失;(2)高效识别并修正逻辑表示中的句法错误;(3)显式利用逻辑表达式引导LLMs生成结构化证明草图;(4)增强LLMs解析TP反馈以实现迭代优化的能力。基于e-SNLI、QASC和WorldTree数据集的实验表明,采用不同LLMs时,所提策略在自动形式化(+18.46%、+34.2%、+39.77%)和解释优化(+29.5%、+51.5%、+41.25%)方面较现有最优模型均有显著提升。此外,我们发现对LLM-TP混合架构的特定干预能大幅提升效率,使成功验证所需的迭代次数急剧减少。


Exploring Multimodal Challenges in Toxic Chinese Detection: Taxonomy, Benchmark, and Findings

Abstract

arXiv:2505.24341v1 Announce Type: cross Abstract: Detecting toxic content using language models is important but challenging. While large language models (LLMs) have demonstrated strong performance in understanding Chinese, recent studies show that simple character substitutions in toxic Chinese text can easily confuse the state-of-the-art (SOTA) LLMs. In this paper, we highlight the multimodal nature of Chinese language as a key challenge for deploying LLMs in toxic Chinese detection. First, we propose a taxonomy of 3 perturbation strategies and 8 specific approaches in toxic Chinese content. Then, we curate a dataset based on this taxonomy, and benchmark 9 SOTA LLMs (from both the US and China) to assess if they can detect perturbed toxic Chinese text. Additionally, we explore cost-effective enhancement solutions like in-context learning (ICL) and supervised fine-tuning (SFT). Our results reveal two important findings. (1) LLMs are less capable of detecting perturbed multimodal Chinese toxic contents. (2) ICL or SFT with a small number of perturbed examples may cause the LLMs "overcorrect'': misidentify many normal Chinese contents as toxic.

摘要

检测有害内容对语言模型而言至关重要却也充满挑战。尽管大语言模型(LLM)在中文理解方面展现出强大性能,但最新研究表明,中文有害文本中简单的字符替换就能轻易干扰最先进的大语言模型。本文指出汉语的多模态特性是部署大语言模型进行中文有害内容检测的核心挑战。首先,我们提出了中文有害内容的三类扰动策略和八种具体方法。基于此分类体系,我们构建了一个数据集,并对9个中美最先进的大语言模型进行基准测试,评估其检测扰动中文有害文本的能力。此外,我们还探索了上下文学习(ICL)和监督微调(SFT)等经济高效的增强方案。研究结果揭示了两项重要发现:(1)大语言模型对多模态中文有害内容的检测能力较弱;(2)使用少量扰动样本进行ICL或SFT可能导致模型"过度修正":将大量正常中文内容误判为有害。


ReCalKV: Low-Rank KV Cache Compression via Head Reordering and Offline Calibration

Abstract

arXiv:2505.24357v1 Announce Type: cross Abstract: Large language models (LLMs) have achieved remarkable performance, yet their capability on long-context reasoning is often constrained by the excessive memory required to store the Key-Value (KV) cache. This makes KV cache compression an essential step toward enabling efficient long-context reasoning. Recent methods have explored reducing the hidden dimensions of the KV cache, but many introduce additional computation through projection layers or suffer from significant performance degradation under high compression ratios. To address these challenges, we propose ReCalKV, a post-training KV cache compression method that reduces the hidden dimensions of the KV cache. We develop distinct compression strategies for Keys and Values based on their different roles and varying importance in the attention mechanism. For Keys, we propose Head-wise Similarity-aware Reordering (HSR), which clusters similar heads and applies grouped SVD to the key projection matrix, reducing additional computation while preserving accuracy. For Values, we propose Offline Calibration and Matrix Fusion (OCMF) to preserve accuracy without extra computational overhead. Experiments show that ReCalKV outperforms existing low-rank compression methods, achieving high compression ratios with minimal performance loss. Code is available at: https://github.com/XIANGLONGYAN/ReCalKV.

摘要

大语言模型(LLMs)已展现出卓越性能,但其长上下文推理能力常受限于存储键值(KV)缓存所需的高内存开销。因此,KV缓存压缩成为实现高效长上下文推理的关键步骤。现有方法多尝试降低KV缓存的隐藏维度,但往往通过投影层引入额外计算,或在高压缩比下导致显著性能下降。针对这些挑战,我们提出ReCalKV——一种训练后KV缓存压缩方法,通过降低KV缓存的隐藏维度实现压缩。基于键和值在注意力机制中的不同作用及重要性差异,我们设计了差异化压缩策略:对于键,提出头部相似性感知重排序(HSR),通过聚类相似注意力头并对键投影矩阵实施分组奇异值分解(SVD),在减少额外计算的同时保持精度;对于值,采用离线校准与矩阵融合(OCMF)策略,无需额外计算开销即可维持精度。实验表明,ReCalKV在高压缩比下以最小性能损失优于现有低秩压缩方法。代码详见:https://github.com/XIANGLONGYAN/ReCalKV。


AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning

Abstract

arXiv:2505.24298v1 Announce Type: cross Abstract: Reinforcement learning (RL) has become a trending paradigm for training large language models (LLMs), particularly for reasoning tasks. Effective RL for LLMs requires massive parallelization and poses an urgent need for efficient training systems. Most existing large-scale RL systems for LLMs are synchronous by alternating generation and training in a batch setting, where the rollouts in each training batch are generated by the same (or latest) model. This stabilizes RL training but suffers from severe system-level inefficiency. Generation must wait until the longest output in the batch is completed before model update, resulting in GPU underutilization. We present AReaL, a \emph{fully asynchronous} RL system that completely decouples generation from training. Rollout workers in AReaL continuously generate new outputs without waiting, while training workers update the model whenever a batch of data is collected. AReaL also incorporates a collection of system-level optimizations, leading to substantially higher GPU utilization. To stabilize RL training, AReaL balances the workload of rollout and training workers to control data staleness, and adopts a staleness-enhanced PPO variant to better handle outdated training samples. Extensive experiments on math and code reasoning benchmarks show that AReaL achieves \textbf{up to 2.57×\times training speedup} compared to the best synchronous systems with the same number of GPUs and matched or even improved final performance. The code of AReaL is available at https://github.com/inclusionAI/AReaL/.

摘要

强化学习(RL)已成为训练大语言模型(LLMs)的主流范式,尤其在推理任务中。针对LLMs的有效强化学习需要大规模并行化,因此亟需高效的训练系统。现有大多数面向LLMs的大规模RL系统采用同步方式,在批处理设置中交替进行生成和训练,其中每个训练批次的轨迹均由相同(或最新)模型生成。这种方法虽能稳定RL训练,但存在严重的系统级低效问题——生成阶段必须等待批次中最长输出完成后才能进行模型更新,导致GPU利用率不足。我们提出AReaL系统,这是一种完全异步的RL架构,彻底解耦了生成与训练过程。AReaL中的轨迹生成工作器无需等待即可持续产生新输出,而训练工作器在收集到批量数据后立即更新模型。该系统还集成了一系列系统级优化方案,显著提升了GPU利用率。为稳定RL训练,AReaL通过平衡生成与训练工作器的负载来控制数据陈旧度,并采用改进的PPO算法变体以更好地处理过时训练样本。在数学和代码推理基准测试上的大量实验表明,在GPU数量相同的情况下,AReaL相比最佳同步系统可实现高达2.57倍的训练加速,同时保持相当甚至更优的最终性能。AReaL代码已开源:https://github.com/inclusionAI/AReaL/。


Grid-LOGAT: Grid Based Local and Global Area Transcription for Video Question Answering

Abstract

arXiv:2505.24371v1 Announce Type: cross Abstract: In this paper, we propose a Grid-based Local and Global Area Transcription (Grid-LoGAT) system for Video Question Answering (VideoQA). The system operates in two phases. First, extracting text transcripts from video frames using a Vision-Language Model (VLM). Next, processing questions using these transcripts to generate answers through a Large Language Model (LLM). This design ensures image privacy by deploying the VLM on edge devices and the LLM in the cloud. To improve transcript quality, we propose grid-based visual prompting, which extracts intricate local details from each grid cell and integrates them with global information. Evaluation results show that Grid-LoGAT, using the open-source VLM (LLaVA-1.6-7B) and LLM (Llama-3.1-8B), outperforms state-of-the-art methods with similar baseline models on NExT-QA and STAR-QA datasets with an accuracy of 65.9% and 50.11% respectively. Additionally, our method surpasses the non-grid version by 24 points on localization-based questions we created using NExT-QA.

摘要

本文提出了一种基于网格的局部与全局区域转录系统(Grid-LoGAT)用于视频问答任务。该系统采用两阶段处理流程:首先通过视觉语言模型(VLM)从视频帧中提取文本转录信息,随后利用这些转录内容通过大语言模型(LLM)处理问题并生成答案。该设计通过将VLM部署在边缘设备、LLM部署在云端的方式确保图像隐私。为提高转录质量,我们提出了基于网格的视觉提示方法,从每个网格单元提取精细的局部细节并与全局信息整合。评估结果表明,采用开源VLM(LLaVA-1.6-7B)和LLM(Llama-3.1-8B)的Grid-LoGAT系统在NExT-QA和STAR-QA数据集上分别达到65.9%和50.11%的准确率,优于使用相似基线模型的现有最优方法。此外,在我们基于NExT-QA构建的定位类问题上,本方法较非网格版本实现了24分的性能提升。


Breaking the Gold Standard: Extracting Forgotten Data under Exact Unlearning in Large Language Models

Abstract

arXiv:2505.24379v1 Announce Type: cross Abstract: Large language models are typically trained on datasets collected from the web, which may inadvertently contain harmful or sensitive personal information. To address growing privacy concerns, unlearning methods have been proposed to remove the influence of specific data from trained models. Of these, exact unlearning -- which retrains the model from scratch without the target data -- is widely regarded the gold standard, believed to be robust against privacy-related attacks. In this paper, we challenge this assumption by introducing a novel data extraction attack that compromises even exact unlearning. Our method leverages both the pre- and post-unlearning models: by guiding the post-unlearning model using signals from the pre-unlearning model, we uncover patterns that reflect the removed data distribution. Combining model guidance with a token filtering strategy, our attack significantly improves extraction success rates -- doubling performance in some cases -- across common benchmarks such as MUSE, TOFU, and WMDP. Furthermore, we demonstrate our attack's effectiveness on a simulated medical diagnosis dataset to highlight real-world privacy risks associated with exact unlearning. In light of our findings, which suggest that unlearning may, in a contradictory way, increase the risk of privacy leakage, we advocate for evaluation of unlearning methods to consider broader threat models that account not only for post-unlearning models but also for adversarial access to prior checkpoints.

摘要

大型语言模型通常基于从网络收集的数据集进行训练,这些数据集可能无意中包含有害或敏感的个人信息。针对日益增长的隐私担忧,研究者提出了遗忘方法以消除特定数据对已训练模型的影响。其中,精确遗忘——通过排除目标数据重新训练模型——被广泛视为黄金标准,被认为能有效抵御隐私相关攻击。本文挑战了这一假设,提出了一种新型数据提取攻击方法,该方法甚至能够攻破精确遗忘的防御机制。我们的技术同时利用遗忘前与遗忘后的模型:通过使用遗忘前模型的信号引导遗忘后模型,我们发现了反映被删除数据分布的特征模式。结合模型引导与标记过滤策略,本攻击在MUSE、TOFU和WMDP等基准测试中显著提升了数据提取成功率(某些情况下性能翻倍)。此外,我们在模拟医疗诊断数据集上验证了攻击的有效性,揭示了精确遗忘在实际应用中可能带来的隐私风险。研究结果表明,遗忘过程可能以矛盾的方式增加隐私泄露风险,因此我们主张对遗忘方法的评估应当采用更全面的威胁模型,不仅要考察遗忘后模型的安全性,还需考虑攻击者获取历史检查点的情况。


LPASS: Linear Probes as Stepping Stones for vulnerability detection using compressed LLMs

Abstract

arXiv:2505.24451v1 Announce Type: cross Abstract: Large Language Models (LLMs) are being extensively used for cybersecurity purposes. One of them is the detection of vulnerable codes. For the sake of efficiency and effectiveness, compression and fine-tuning techniques are being developed, respectively. However, they involve spending substantial computational efforts. In this vein, we analyse how Linear Probes (LPs) can be used to provide an estimation on the performance of a compressed LLM at an early phase -- before fine-tuning. We also show their suitability to set the cut-off point when applying layer pruning compression. Our approach, dubbed LPASSLPASS, is applied in BERT and Gemma for the detection of 12 of MITRE's Top 25 most dangerous vulnerabilities on 480k C/C++ samples. LPs can be computed in 142.97 s. and provide key findings: (1) 33.3 % and 72.2% of layers can be removed, respectively, with no precision loss; (2) they provide an early estimate of the post-fine-tuning and post-compression model effectiveness, with 3% and 8.68% as the lowest and average precision errors, respectively. LPASSLPASS-based LLMs outperform the state of the art, reaching 86.9% of accuracy in multi-class vulnerability detection. Interestingly, LPASSLPASS-based compressed versions of Gemma outperform the original ones by 1.6% of F1-score at a maximum while saving 29.4 % and 23.8% of training and inference time and 42.98% of model size.


LLMs Are Globally Multilingual Yet Locally Monolingual: Exploring Knowledge Transfer via Language and Thought Theory

Abstract

arXiv:2505.24409v1 Announce Type: cross Abstract: Multilingual large language models (LLMs) open up new possibilities for leveraging information across languages, but their factual knowledge recall remains inconsistent depending on the input language. While previous studies have attempted to address this issue through English-based prompting and evaluation, we explore non-English to English transfer via Language and Thought Theory. This perspective allows us to examine language-thought binding in LLMs and uncover why factual knowledge often fails to transfer effectively. We propose the Language-to-Thought (L2T) prompting strategy, which analyzes the relationship between input language, internal cognitive processes, and knowledge. Experimental results challenge the assumption that English-based approaches consistently outperform other languages and offer a novel insight that aligning the model's internal thought with the knowledge required for the task is critical for successful cross-lingual transfer. Furthermore, we show that applying L2T during training can alleviate LLMs' reliance on the input language and facilitate cross-linguistic knowledge integration without translation-based learning. Code and datasets will be available.

摘要

多语言大语言模型(LLMs)为跨语言信息利用开辟了新途径,但其事实知识回忆能力仍因输入语言不同而存在差异。尽管先前研究尝试通过基于英语的提示和评估来解决这一问题,我们借助语言与思维理论探索了非英语到英语的知识迁移。这一视角使我们能够考察LLMs中语言与思维的绑定机制,并揭示事实知识为何经常无法有效迁移的原因。我们提出"语言到思维"(L2T)提示策略,通过分析输入语言、内部认知过程与知识之间的关系。实验结果挑战了"基于英语的方法始终优于其他语言"的假设,并提出创新性见解:将模型内部思维与任务所需知识对齐是实现成功跨语言迁移的关键。此外,我们证明在训练阶段应用L2T策略可以减轻LLMs对输入语言的依赖,促进无需翻译学习的跨语言知识整合。代码与数据集将公开提供。


Learning Safety Constraints for Large Language Models

Abstract

arXiv:2505.24445v1 Announce Type: cross Abstract: Large language models (LLMs) have emerged as powerful tools but pose significant safety risks through harmful outputs and vulnerability to adversarial attacks. We propose SaP, short for Safety Polytope, a geometric approach to LLM safety that learns and enforces multiple safety constraints directly in the model's representation space. We develop a framework that identifies safe and unsafe regions via the polytope's facets, enabling both detection and correction of unsafe outputs through geometric steering. Unlike existing approaches that modify model weights, SaP operates post-hoc in the representation space, preserving model capabilities while enforcing safety constraints. Experiments across multiple LLMs demonstrate that our method can effectively detect unethical inputs, reduce adversarial attack success rates while maintaining performance on standard tasks, thus highlighting the importance of having an explicit geometric model for safety. Analysis of the learned polytope facets reveals emergence of specialization in detecting different semantic notions of safety, providing interpretable insights into how safety is captured in LLMs' representation space.

摘要

大型语言模型(LLMs)已成为强大工具,但其有害输出及对抗攻击脆弱性也带来重大安全风险。本文提出安全多面体(Safety Polytope,简称SaP)——一种基于几何学的LLM安全方法,通过在模型表征空间中直接学习并强制执行多重安全约束。我们开发了一个框架,通过多面体刻面识别安全与不安全区域,实现基于几何导向的不安全输出检测与校正。与现有修改模型权重的方法不同,SaP在表征空间进行事后操作,在保持模型能力的同时实施安全约束。跨多个LLM的实验表明,该方法能有效检测不道德输入,在保持标准任务性能的同时降低对抗攻击成功率,从而凸显显式几何安全模型的重要性。对习得多面体刻面的分析揭示了其在检测不同安全语义概念时的专业化涌现,为理解LLM表征空间如何捕捉安全性提供了可解释的洞见。


Adversarial Preference Learning for Robust LLM Alignment

Abstract

arXiv:2505.24369v1 Announce Type: cross Abstract: Modern language models often rely on Reinforcement Learning from Human Feedback (RLHF) to encourage safe behaviors. However, they remain vulnerable to adversarial attacks due to three key limitations: (1) the inefficiency and high cost of human annotation, (2) the vast diversity of potential adversarial attacks, and (3) the risk of feedback bias and reward hacking. To address these challenges, we introduce Adversarial Preference Learning (APL), an iterative adversarial training method incorporating three key innovations. First, a direct harmfulness metric based on the model's intrinsic preference probabilities, eliminating reliance on external assessment. Second, a conditional generative attacker that synthesizes input-specific adversarial variations. Third, an iterative framework with automated closed-loop feedback, enabling continuous adaptation through vulnerability discovery and mitigation. Experiments on Mistral-7B-Instruct-v0.3 demonstrate that APL significantly enhances robustness, achieving 83.33% harmlessness win rate over the base model (evaluated by GPT-4o), reducing harmful outputs from 5.88% to 0.43% (measured by LLaMA-Guard), and lowering attack success rate by up to 65% according to HarmBench. Notably, APL maintains competitive utility, with an MT-Bench score of 6.59 (comparable to the baseline 6.78) and an LC-WinRate of 46.52% against the base model.

摘要

现代语言模型常依赖人类反馈强化学习(RLHF)来促进安全行为,但仍存在三个关键缺陷使其易受对抗攻击:(1)人工标注效率低下且成本高昂;(2)潜在对抗攻击类型高度多样化;(3)反馈偏差与奖励破解风险。为解决这些问题,我们提出对抗偏好学习(APL),这是一种融合三项关键创新的迭代对抗训练方法:首先,基于模型内在偏好概率构建直接危害性度量,消除对外部评估的依赖;其次,开发条件生成式攻击器以合成输入特定的对抗变体;最后,建立带自动闭环反馈的迭代框架,通过漏洞发现与修复实现持续适应。在Mistral-7B-Instruct-v0.3上的实验表明,APL显著提升鲁棒性:相较于基线模型实现83.33%无害胜率(GPT-4o评估),有害输出从5.88%降至0.43%(LLaMA-Guard测量),根据HarmBench攻击成功率最高降低65%。值得注意的是,APL保持了竞争力效用指标:MT-Bench得分6.59(基线6.78),LC-WinRate相对基线达46.52%。


Towards Effective Code-Integrated Reasoning

Abstract

arXiv:2505.24480v1 Announce Type: cross Abstract: In this paper, we investigate code-integrated reasoning, where models generate code when necessary and integrate feedback by executing it through a code interpreter. To acquire this capability, models must learn when and how to use external code tools effectively, which is supported by tool-augmented reinforcement learning (RL) through interactive learning. Despite its benefits, tool-augmented RL can still suffer from potential instability in the learning dynamics. In light of this challenge, we present a systematic approach to improving the training effectiveness and stability of tool-augmented RL for code-integrated reasoning. Specifically, we develop enhanced training strategies that balance exploration and stability, progressively building tool-use capabilities while improving reasoning performance. Through extensive experiments on five mainstream mathematical reasoning benchmarks, our model demonstrates significant performance improvements over multiple competitive baselines. Furthermore, we conduct an in-depth analysis of the mechanism and effect of code-integrated reasoning, revealing several key insights, such as the extension of model's capability boundaries and the simultaneous improvement of reasoning efficiency through code integration. All data and code for reproducing this work are available at: https://github.com/RUCAIBox/CIR.

摘要

本文研究了代码集成推理方法,即模型在必要时生成代码并通过代码解释器执行以获得反馈。为掌握这种能力,模型需要学习何时及如何有效使用外部代码工具,这一过程通过工具增强的强化学习(RL)交互训练实现。尽管具有优势,工具增强RL仍可能面临学习动态不稳定的问题。针对这一挑战,我们提出系统性方法以提升代码集成推理中工具增强RL的训练效果与稳定性。具体而言,我们开发了增强训练策略,平衡探索与稳定性,在逐步构建工具使用能力的同时提升推理性能。通过在五大主流数学推理基准上的大量实验,我们的模型相比多个竞争基线展现出显著性能提升。此外,我们对代码集成推理的机制与效果进行了深入分析,揭示了若干关键发现,例如模型能力边界的扩展以及通过代码集成实现推理效率的同步提升。本工作的复现数据与代码详见:https://github.com/RUCAIBox/CIR。


Train One Sparse Autoencoder Across Multiple Sparsity Budgets to Preserve Interpretability and Accuracy

Abstract

arXiv:2505.24473v1 Announce Type: cross Abstract: Sparse Autoencoders (SAEs) have proven to be powerful tools for interpreting neural networks by decomposing hidden representations into disentangled, interpretable features via sparsity constraints. However, conventional SAEs are constrained by the fixed sparsity level chosen during training; meeting different sparsity requirements therefore demands separate models and increases the computational footprint during both training and evaluation. We introduce a novel training objective, \emph{HierarchicalTopK}, which trains a single SAE to optimise reconstructions across multiple sparsity levels simultaneously. Experiments with Gemma-2 2B demonstrate that our approach achieves Pareto-optimal trade-offs between sparsity and explained variance, outperforming traditional SAEs trained at individual sparsity levels. Further analysis shows that HierarchicalTopK preserves high interpretability scores even at higher sparsity. The proposed objective thus closes an important gap between flexibility and interpretability in SAE design.

摘要

稀疏自编码器(SAEs)通过稀疏性约束将神经网络的隐藏表征分解为可解耦、可解释的特征,已被证明是解释神经网络的有效工具。然而,传统SAEs受限于训练时选定的固定稀疏度水平,为满足不同稀疏度需求必须训练多个独立模型,这显著增加了训练和评估阶段的计算开销。我们提出了一种新型训练目标——分层TopK(HierarchicalTopK),该目标能够训练单一SAE模型同时优化多个稀疏度水平下的重构性能。基于Gemma-2 2B模型的实验表明,我们的方法在稀疏度与解释方差之间实现了帕累托最优权衡,其性能优于针对单一稀疏度训练的传统SAEs。进一步分析显示,分层TopK即使在较高稀疏度下仍能保持优异的可解释性评分。因此,该目标有效弥补了SAE设计中灵活性与可解释性之间的重要鸿沟。


Evaluating Gemini in an arena for learning

Abstract

arXiv:2505.24477v1 Announce Type: cross Abstract: Artificial intelligence (AI) is poised to transform education, but the research community lacks a robust, general benchmark to evaluate AI models for learning. To assess state-of-the-art support for educational use cases, we ran an "arena for learning" where educators and pedagogy experts conduct blind, head-to-head, multi-turn comparisons of leading AI models. In particular, N=189N = 189 educators drew from their experience to role-play realistic learning use cases, interacting with two models sequentially, after which N=206N = 206 experts judged which model better supported the user's learning goals. The arena evaluated a slate of state-of-the-art models: Gemini 2.5 Pro, Claude 3.7 Sonnet, GPT-4o, and OpenAI o3. Excluding ties, experts preferred Gemini 2.5 Pro in 73.2% of these match-ups -- ranking it first overall in the arena. Gemini 2.5 Pro also demonstrated markedly higher performance across key principles of good pedagogy. Altogether, these results position Gemini 2.5 Pro as a leading model for learning.

摘要

人工智能(AI)即将变革教育领域,但研究界目前缺乏一个强大且通用的基准来评估适用于学习场景的AI模型。为评估当前最先进模型在教育用例中的支持能力,我们建立了一个"学习竞技场"——由教育工作者和教学专家对主流AI模型进行盲法、多轮次、两两对比评估。具体而言,189名教育工作者基于实际经验模拟真实学习场景,依次与两个模型互动;随后206名专家评判哪个模型更有效地支持用户学习目标。该竞技场评估了包括Gemini 2.5 Pro、Claude 3.7 Sonnet、GPT-4o和OpenAI o3在内的一系列前沿模型。在排除平局情况后,专家在73.2%的对决中更倾向于选择Gemini 2.5 Pro,使其在总体排名中位列第一。Gemini 2.5 Pro在优质教学法的关键原则方面也展现出显著更高的性能表现。综合来看,这些结果表明Gemini 2.5 Pro堪称当前最适用于学习场景的领先模型。


TimeHC-RL: Temporal-aware Hierarchical Cognitive Reinforcement Learning for Enhancing LLMs' Social Intelligence

Abstract

arXiv:2505.24500v1 Announce Type: cross Abstract: Recently, Large Language Models (LLMs) have made significant progress in IQ-related domains that require careful thinking, such as mathematics and coding. However, enhancing LLMs' cognitive development in social domains, particularly from a post-training perspective, remains underexplored. Recognizing that the social world follows a distinct timeline and requires a richer blend of cognitive modes (from intuitive reactions (System 1) and surface-level thinking to deliberate thinking (System 2)) than mathematics, which primarily relies on System 2 cognition (careful, step-by-step reasoning), we introduce Temporal-aware Hierarchical Cognitive Reinforcement Learning (TimeHC-RL) for enhancing LLMs' social intelligence. In our experiments, we systematically explore improving LLMs' social intelligence and validate the effectiveness of the TimeHC-RL method, through five other post-training paradigms and two test-time intervention paradigms on eight datasets with diverse data patterns. Experimental results reveal the superiority of our proposed TimeHC-RL method compared to the widely adopted System 2 RL method. It gives the 7B backbone model wings, enabling it to rival the performance of advanced models like DeepSeek-R1 and OpenAI-O3. Additionally, the systematic exploration from post-training and test-time interventions perspectives to improve LLMs' social intelligence has uncovered several valuable insights.

摘要

近期,大语言模型(LLMs)在需要缜密思考的智商相关领域(如数学与编程)取得了显著进展。然而,从训练后优化的视角提升LLMs在社会认知领域的发展仍待深入探索。鉴于社会情境遵循独特的时间线,且需要比数学领域(主要依赖系统2认知——审慎的逐步推理)更丰富的认知模式组合(从直觉反应(系统1)与浅层思考到深思熟虑(系统2)),我们提出了时序感知分层认知强化学习(TimeHC-RL)来增强LLMs的社会智能。实验中,我们通过五种其他训练后范式与两种测试时干预范式,在八种数据模式各异的数据集上系统探索了LLMs社会智能的提升,并验证了TimeHC-RL方法的有效性。实验结果表明,相较于广泛采用的系统2强化学习方法,我们提出的TimeHC-RL方法具有显著优势。该方法为70亿参数的基础模型赋能,使其性能可比肩DeepSeek-R1与OpenAI-O3等先进模型。此外,从训练后优化与测试时干预双重视角系统提升LLMs社会智能的研究,还揭示了若干重要发现。


Beyond Linear Steering: Unified Multi-Attribute Control for Language Models

Abstract

arXiv:2505.24535v1 Announce Type: cross Abstract: Controlling multiple behavioral attributes in large language models (LLMs) at inference time is a challenging problem due to interference between attributes and the limitations of linear steering methods, which assume additive behavior in activation space and require per-attribute tuning. We introduce K-Steering, a unified and flexible approach that trains a single non-linear multi-label classifier on hidden activations and computes intervention directions via gradients at inference time. This avoids linearity assumptions, removes the need for storing and tuning separate attribute vectors, and allows dynamic composition of behaviors without retraining. To evaluate our method, we propose two new benchmarks, ToneBank and DebateMix, targeting compositional behavioral control. Empirical results across 3 model families, validated by both activation-based classifiers and LLM-based judges, demonstrate that K-Steering outperforms strong baselines in accurately steering multiple behaviors.

摘要

控制大型语言模型(LLM)在推理时多个行为属性是一个具有挑战性的问题,这源于属性间的相互干扰以及线性引导方法的局限性——后者假设激活空间具有可加性且需针对每个属性单独调参。我们提出K-Steering方法,通过训练一个基于隐藏激活的非线性多标签分类器,并在推理时通过梯度计算干预方向,形成统一灵活的解决方案。该方法无需线性假设,避免了存储和调整个别属性向量的需求,同时支持行为动态组合而无需重新训练。为评估该方法,我们构建了ToneBank和DebateMix两个新基准测试,专注于组合行为控制研究。在三个模型系列上的实验结果表明,经基于激活的分类器和LLM评判者双重验证,K-Steering在精确引导多重行为方面优于现有基线方法。


Can Slow-thinking LLMs Reason Over Time? Empirical Studies in Time Series Forecasting

Abstract

arXiv:2505.24511v1 Announce Type: cross Abstract: Time series forecasting (TSF) is a fundamental and widely studied task, spanning methods from classical statistical approaches to modern deep learning and multimodal language modeling. Despite their effectiveness, these methods often follow a fast thinking paradigm emphasizing pattern extraction and direct value mapping, while overlooking explicit reasoning over temporal dynamics and contextual dependencies. Meanwhile, emerging slow-thinking LLMs (e.g., ChatGPT-o1, DeepSeek-R1) have demonstrated impressive multi-step reasoning capabilities across diverse domains, suggesting a new opportunity for reframing TSF as a structured reasoning task. This motivates a key question: can slow-thinking LLMs effectively reason over temporal patterns to support time series forecasting, even in zero-shot manner? To investigate this, in this paper, we propose TimeReasoner, an extensive empirical study that formulates TSF as a conditional reasoning task. We design a series of prompting strategies to elicit inference-time reasoning from pretrained slow-thinking LLMs and evaluate their performance across diverse TSF benchmarks. Our findings reveal that slow-thinking LLMs exhibit non-trivial zero-shot forecasting capabilities, especially in capturing high-level trends and contextual shifts. While preliminary, our study surfaces important insights into the reasoning behaviors of LLMs in temporal domains highlighting both their potential and limitations. We hope this work catalyzes further research into reasoning-based forecasting paradigms and paves the way toward more interpretable and generalizable TSF frameworks.

摘要

时间序列预测(TSF)是一项基础且被广泛研究的任务,其方法涵盖从经典统计方法到现代深度学习和多模态语言建模。尽管这些方法有效,但它们通常遵循强调模式提取和直接值映射的"快速思考"范式,而忽视了对时间动态和上下文依赖关系的显式推理。与此同时,新兴的"慢思考"大语言模型(如ChatGPT-o1、DeepSeek-R1)在多个领域展现出令人印象深刻的多步推理能力,这为将TSF重新构建为结构化推理任务提供了新机遇。由此引出一个关键问题:慢思考大语言模型能否有效推理时间模式以支持时间序列预测,甚至以零样本方式实现?为探究此问题,本文提出TimeReasoner——一项将TSF构建为条件推理任务的系统性实证研究。我们设计了一系列提示策略,从预训练的慢思考大语言模型中激发推理时推理能力,并在多样化TSF基准上评估其性能。研究发现表明,慢思考大语言模型展现出显著的零样本预测能力,尤其在捕捉高层趋势和上下文变化方面。虽然尚属初步,但本研究揭示了LLM在时间领域推理行为的重要洞见,既凸显了其潜力也指明了局限性。我们期望这项工作能推动基于推理的预测范式研究,并为构建更具可解释性和泛化能力的TSF框架铺平道路。


CREFT: Sequential Multi-Agent LLM for Character Relation Extraction

Abstract

arXiv:2505.24553v1 Announce Type: cross Abstract: Understanding complex character relations is crucial for narrative analysis and efficient script evaluation, yet existing extraction methods often fail to handle long-form narratives with nuanced interactions. To address this challenge, we present CREFT, a novel sequential framework leveraging specialized Large Language Model (LLM) agents. First, CREFT builds a base character graph through knowledge distillation, then iteratively refines character composition, relation extraction, role identification, and group assignments. Experiments on a curated Korean drama dataset demonstrate that CREFT significantly outperforms single-agent LLM baselines in both accuracy and completeness. By systematically visualizing character networks, CREFT streamlines narrative comprehension and accelerates script review -- offering substantial benefits to the entertainment, publishing, and educational sectors.

摘要

理解复杂人物关系对于叙事分析和高效剧本评估至关重要,但现有提取方法往往难以处理具有微妙互动的长篇幅叙事。为解决这一挑战,我们提出CREFT——一种基于专用大语言模型(LLM)智能体的新型序列框架。该框架首先通过知识蒸馏构建基础人物图谱,随后迭代优化人物构成、关系提取、角色识别及群体划分。在精选韩剧数据集上的实验表明,CREFT在准确性和完整性方面显著优于单智能体LLM基线方法。通过系统化的人物关系网络可视化,CREFT能有效提升叙事理解效率并加速剧本审阅流程,为娱乐、出版及教育领域带来显著价值。


Stress-testing Machine Generated Text Detection: Shifting Language Models Writing Style to Fool Detectors

Abstract

arXiv:2505.24523v1 Announce Type: cross Abstract: Recent advancements in Generative AI and Large Language Models (LLMs) have enabled the creation of highly realistic synthetic content, raising concerns about the potential for malicious use, such as misinformation and manipulation. Moreover, detecting Machine-Generated Text (MGT) remains challenging due to the lack of robust benchmarks that assess generalization to real-world scenarios. In this work, we present a pipeline to test the resilience of state-of-the-art MGT detectors (e.g., Mage, Radar, LLM-DetectAIve) to linguistically informed adversarial attacks. To challenge the detectors, we fine-tune language models using Direct Preference Optimization (DPO) to shift the MGT style toward human-written text (HWT). This exploits the detectors' reliance on stylistic clues, making new generations more challenging to detect. Additionally, we analyze the linguistic shifts induced by the alignment and which features are used by detectors to detect MGT texts. Our results show that detectors can be easily fooled with relatively few examples, resulting in a significant drop in detection performance. This highlights the importance of improving detection methods and making them robust to unseen in-domain texts.

摘要

生成式人工智能与大语言模型(LLMs)的最新进展使得高度逼真的合成内容生成成为可能,这引发了人们对恶意使用(如虚假信息与操纵)的担忧。然而,由于缺乏评估现实场景泛化能力的可靠基准,机器生成文本(MGT)的检测仍具挑战性。本研究提出一种流程,用于测试最先进MGT检测器(如Mage、Radar、LLM-DetectAIve)对语言学对抗攻击的鲁棒性。为挑战检测器,我们采用直接偏好优化(DPO)对语言模型进行微调,将MGT风格向人类书写文本(HWT)偏移。该方法利用检测器对风格线索的依赖,使新生成文本更难被识别。此外,我们分析了对齐过程引发的语言学特征变化,以及检测器用于识别MGT文本的特征指标。实验结果表明,仅需少量样本即可显著降低检测器性能,凸显了改进检测方法并增强其对未见域内文本鲁棒性的重要性。


Localizing Persona Representations in LLMs

Abstract

arXiv:2505.24539v1 Announce Type: cross Abstract: We present a study on how and where personas -- defined by distinct sets of human characteristics, values, and beliefs -- are encoded in the representation space of large language models (LLMs). Using a range of dimension reduction and pattern recognition methods, we first identify the model layers that show the greatest divergence in encoding these representations. We then analyze the activations within a selected layer to examine how specific personas are encoded relative to others, including their shared and distinct embedding spaces. We find that, across multiple pre-trained decoder-only LLMs, the analyzed personas show large differences in representation space only within the final third of the decoder layers. We observe overlapping activations for specific ethical perspectives -- such as moral nihilism and utilitarianism -- suggesting a degree of polysemy. In contrast, political ideologies like conservatism and liberalism appear to be represented in more distinct regions. These findings help to improve our understanding of how LLMs internally represent information and can inform future efforts in refining the modulation of specific human traits in LLM outputs. Warning: This paper includes potentially offensive sample statements.

摘要

我们针对人格特征(由不同人类特质、价值观和信仰所定义)在大型语言模型(LLMs)表征空间中的编码方式和位置展开研究。通过采用多种降维和模式识别方法,我们首先识别出在编码这些表征时差异最显著的模型层级。随后我们选取特定层级的激活值进行分析,以探究不同人格特征(包括其共享和独有嵌入空间)的相对编码方式。研究发现:在多个预训练的仅解码器LLM中,被分析的人格特征仅在解码器最后三分之一的层级中表现出显著的表征空间差异。我们观察到特定伦理观点(如道德虚无主义与功利主义)存在激活重叠现象,表明存在一定程度的语义多义性。相比之下,保守主义与自由主义等政治意识形态则呈现出更显著的区域分离特征。这些发现有助于深化我们对LLMs内部信息表征机制的理解,并为未来优化LLM输出中特定人类特质的调控提供参考依据。注:本文包含可能具有冒犯性的示例陈述。


Cross-Attention Speculative Decoding

Abstract

arXiv:2505.24544v1 Announce Type: cross Abstract: Speculative decoding (SD) is a widely adopted approach for accelerating inference in large language models (LLMs), particularly when the draft and target models are well aligned. However, state-of-the-art SD methods typically rely on tightly coupled, self-attention-based Transformer decoders, often augmented with auxiliary pooling or fusion layers. This coupling makes them increasingly complex and harder to generalize across different models. We present Budget EAGLE (Beagle), the first, to our knowledge, cross-attention-based Transformer decoder SD model that achieves performance on par with leading self-attention SD models (EAGLE-v2) while eliminating the need for pooling or auxiliary components, simplifying the architecture, improving training efficiency, and maintaining stable memory usage during training-time simulation. To enable effective training of this novel architecture, we propose Two-Stage Block-Attention Training, a new method that achieves training stability and convergence efficiency in block-level attention scenarios. Extensive experiments across multiple LLMs and datasets show that Beagle achieves competitive inference speedups and higher training efficiency than EAGLE-v2, offering a strong alternative for architectures in speculative decoding.

摘要

推测解码(SD)是一种广泛应用于加速大语言模型(LLM)推理的方法,尤其在草案模型与目标模型高度对齐时效果显著。然而,当前最先进的SD方法通常依赖于紧密耦合的、基于自注意力机制的Transformer解码器,并常需辅以额外的池化或融合层。这种耦合导致其架构日益复杂,且难以在不同模型间泛化。我们提出了Budget EAGLE(Beagle),据我们所知,这是首个基于交叉注意力机制的Transformer解码器SD模型,其性能与领先的自注意力SD模型(EAGLE-v2)相当,同时无需池化或辅助组件,简化了架构,提升了训练效率,并在训练时模拟过程中保持稳定的内存使用。为有效训练这一新颖架构,我们提出两阶段块注意力训练法,该方法在块级注意力场景下实现了训练稳定性和收敛效率。跨多个LLM和数据集的广泛实验表明,Beagle在推理加速和训练效率上均优于EAGLE-v2,为推测解码架构提供了强有力的替代方案。


Mixpert: Mitigating Multimodal Learning Conflicts with Efficient Mixture-of-Vision-Experts

Abstract

arXiv:2505.24541v1 Announce Type: cross Abstract: Multimodal large language models (MLLMs) require a nuanced interpretation of complex image information, typically leveraging a vision encoder to perceive various visual scenarios. However, relying solely on a single vision encoder to handle diverse task domains proves difficult and inevitably leads to conflicts. Recent work enhances data perception by directly integrating multiple domain-specific vision encoders, yet this structure adds complexity and limits the potential for joint optimization. In this paper, we introduce Mixpert, an efficient mixture-of-vision-experts architecture that inherits the joint learning advantages from a single vision encoder while being restructured into a multi-expert paradigm for task-specific fine-tuning across different visual tasks. Additionally, we design a dynamic routing mechanism that allocates input images to the most suitable visual expert. Mixpert effectively alleviates domain conflicts encountered by a single vision encoder in multi-task learning with minimal additional computational cost, making it more efficient than multiple encoders. Furthermore, Mixpert integrates seamlessly into any MLLM, with experimental results demonstrating substantial performance gains across various tasks.

摘要

多模态大语言模型(MLLMs)需要对复杂图像信息进行精细解读,通常依赖视觉编码器来感知多样化的视觉场景。然而,仅依靠单一视觉编码器处理多任务领域存在显著困难,并不可避免地导致领域冲突。近期研究通过直接集成多个领域专用视觉编码器来增强数据感知能力,但该结构增加了系统复杂性并限制了联合优化的可能性。本文提出Mixpert——一种高效的视觉专家混合架构,该架构既继承了单视觉编码器的联合学习优势,又能重构为多专家范式以适配不同视觉任务的针对性微调。此外,我们设计了动态路由机制,将输入图像分配至最合适的视觉专家。Mixpert以极低的计算开销有效缓解了单视觉编码器在多任务学习中的领域冲突问题,其效率显著优于多编码器方案。实验表明,该架构可无缝集成至任意MLLM,并在多种任务上实现显著的性能提升。


NexusSum: Hierarchical LLM Agents for Long-Form Narrative Summarization

Abstract

arXiv:2505.24575v1 Announce Type: cross Abstract: Summarizing long-form narratives--such as books, movies, and TV scripts--requires capturing intricate plotlines, character interactions, and thematic coherence, a task that remains challenging for existing LLMs. We introduce NexusSum, a multi-agent LLM framework for narrative summarization that processes long-form text through a structured, sequential pipeline--without requiring fine-tuning. Our approach introduces two key innovations: (1) Dialogue-to-Description Transformation: A narrative-specific preprocessing method that standardizes character dialogue and descriptive text into a unified format, improving coherence. (2) Hierarchical Multi-LLM Summarization: A structured summarization pipeline that optimizes chunk processing and controls output length for accurate, high-quality summaries. Our method establishes a new state-of-the-art in narrative summarization, achieving up to a 30.0% improvement in BERTScore (F1) across books, movies, and TV scripts. These results demonstrate the effectiveness of multi-agent LLMs in handling long-form content, offering a scalable approach for structured summarization in diverse storytelling domains.

摘要

长篇幅叙事文本(如书籍、电影和电视剧本)的摘要生成需要捕捉复杂的剧情线、角色互动和主题连贯性,这对现有大语言模型(LLMs)仍具挑战性。我们提出NexusSum——一种用于叙事摘要的多智能体LLM框架,通过结构化顺序流水线处理长文本,且无需微调。该方法包含两项关键创新:(1)对话-描述转换:一种叙事专用预处理技术,将角色对话与描述性文本统一标准化,提升连贯性;(2)分层多LLM摘要:结构化摘要流水线,优化文本块处理并控制输出长度,生成精确高质量摘要。我们的方法在叙事摘要任务中实现了最先进水平,在书籍、电影和电视剧本上的BERTScore(F1)最高提升30.0%。这些结果证明了多智能体LLMs处理长文本的有效性,为跨叙事领域的结构化摘要提供了可扩展方案。


Bench4KE: Benchmarking Automated Competency Question Generation

Abstract

arXiv:2505.24554v1 Announce Type: cross Abstract: The availability of Large Language Models (LLMs) presents a unique opportunity to reinvigorate research on Knowledge Engineering (KE) automation, a trend already evident in recent efforts developing LLM-based methods and tools for the automatic generation of Competency Questions (CQs). However, the evaluation of these tools lacks standardisation. This undermines the methodological rigour and hinders the replication and comparison of results. To address this gap, we introduce Bench4KE, an extensible API-based benchmarking system for KE automation. Its first release focuses on evaluating tools that generate CQs automatically. CQs are natural language questions used by ontology engineers to define the functional requirements of an ontology. Bench4KE provides a curated gold standard consisting of CQ datasets from four real-world ontology projects. It uses a suite of similarity metrics to assess the quality of the CQs generated. We present a comparative analysis of four recent CQ generation systems, which are based on LLMs, establishing a baseline for future research. Bench4KE is also designed to accommodate additional KE automation tasks, such as SPARQL query generation, ontology testing and drafting. Code and datasets are publicly available under the Apache 2.0 license.

摘要

大型语言模型(LLM)的出现为知识工程(KE)自动化研究注入了新的活力,这一趋势在近期基于LLM的自动化能力问题(CQ)生成方法与工具开发中已得到体现。然而,现有评估工具缺乏标准化,不仅影响方法论的严谨性,也阻碍了研究结果的复现与比较。为此,我们提出Bench4KE——一个基于API的可扩展KE自动化基准测试系统。其首个版本专注于评估自动生成CQ的工具,CQ是本体工程师用于定义本体功能需求的自然语言问题。Bench4KE提供经过筛选的金标准数据集,包含来自四个真实本体项目的CQ集合,并采用相似性度量套件评估生成CQ的质量。我们通过对四个基于LLM的最新CQ生成系统进行对比分析,为未来研究建立基准。该系统还可扩展支持其他KE自动化任务,如SPARQL查询生成、本体测试与起草。代码及数据集均依据Apache 2.0许可证公开。


Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors

Abstract

arXiv:2505.24625v1 Announce Type: cross Abstract: Previous research has investigated the application of Multimodal Large Language Models (MLLMs) in understanding 3D scenes by interpreting them as videos. These approaches generally depend on comprehensive 3D data inputs, such as point clouds or reconstructed Bird's-Eye View (BEV) maps. In our research, we advance this field by enhancing the capability of MLLMs to understand and reason in 3D spaces directly from video data, without the need for additional 3D input. We propose a novel and efficient method, the Video-3D Geometry Large Language Model (VG LLM). Our approach employs a 3D visual geometry encoder that extracts 3D prior information from video sequences. This information is integrated with visual tokens and fed into the MLLM. Extensive experiments have shown that our method has achieved substantial improvements in various tasks related to 3D scene understanding and spatial reasoning, all directly learned from video sources. Impressively, our 4B model, which does not rely on explicit 3D data inputs, achieves competitive results compared to existing state-of-the-art methods, and even surpasses the Gemini-1.5-Pro in the VSI-Bench evaluations.

摘要

先前研究探讨了多模态大语言模型(MLLMs)通过将三维场景解释为视频来实现场景理解的应用。这些方法通常依赖于完整的3D数据输入,如点云或重建的鸟瞰图(BEV)。在本研究中,我们通过增强MLLMs直接从视频数据理解和推理三维空间的能力推进了这一领域,无需额外3D输入。我们提出了一种新颖高效的方法——视频三维几何大语言模型(VG LLM)。该方法采用三维视觉几何编码器从视频序列中提取三维先验信息,并将其与视觉标记融合后输入MLLM。大量实验表明,我们的方法在直接从视频学习的各类三维场景理解和空间推理任务中均取得显著提升。值得注意的是,不依赖显式3D数据输入的40亿参数模型不仅与现有最先进方法取得相当成果,更在VSI-Bench评估中超越了Gemini-1.5-Pro模型。


BIMA: Bijective Maximum Likelihood Learning Approach to Hallucination Prediction and Mitigation in Large Vision-Language Models

Abstract

arXiv:2505.24649v1 Announce Type: cross Abstract: Large vision-language models have become widely adopted to advance in various domains. However, developing a trustworthy system with minimal interpretable characteristics of large-scale models presents a significant challenge. One of the most prevalent terms associated with the fallacy functions caused by these systems is hallucination, where the language model generates a response that does not correspond to the visual content. To mitigate this problem, several approaches have been developed, and one prominent direction is to ameliorate the decoding process. In this paper, we propose a new Bijective Maximum Likelihood Learning (BIMA) approach to hallucination mitigation using normalizing flow theories. The proposed BIMA method can efficiently mitigate the hallucination problem in prevailing vision-language models, resulting in significant improvements. Notably, BIMA achieves the average F1 score of 85.06% on POPE benchmark and remarkably reduce CHAIRS and CHAIRI by 7.6% and 2.6%, respectively. To the best of our knowledge, this is one of the first studies that contemplates the bijection means to reduce hallucination induced by large vision-language models.

摘要

大型视觉语言模型已在多个领域得到广泛应用。然而,开发具有可解释特性且值得信赖的大规模模型系统仍面临重大挑战。这类系统最常见的功能谬误之一被称为"幻觉",即语言模型生成的响应与视觉内容不符。为缓解该问题,学界已提出多种方法,其中重要方向之一是改进解码过程。本文提出一种基于归一化流理论的双射最大似然学习(BIMA)方法来抑制幻觉现象。所提出的BIMA方法能有效缓解主流视觉语言模型中的幻觉问题,取得显著改进效果。值得注意的是,BIMA在POPE基准测试中平均F1分数达到85.06%,同时将CHAIRS和CHAIRI指标分别降低7.6%和2.6%。据我们所知,这是首批探讨利用双射方法减少大型视觉语言模型诱发幻觉的研究之一。


Eye of Judgement: Dissecting the Evaluation of Russian-speaking LLMs with POLLUX

Abstract

arXiv:2505.24616v1 Announce Type: cross Abstract: We introduce POLLUX, a comprehensive open-source benchmark designed to evaluate the generative capabilities of large language models (LLMs) in Russian. Our main contribution is a novel evaluation methodology that enhances the interpretability of LLM assessment. For each task type, we define a set of detailed criteria and develop a scoring protocol where models evaluate responses and provide justifications for their ratings. This enables transparent, criteria-driven evaluation beyond traditional resource-consuming, side-by-side human comparisons. POLLUX includes a detailed, fine-grained taxonomy of 35 task types covering diverse generative domains such as code generation, creative writing, and practical assistant use cases, totaling 2,100 manually crafted and professionally authored prompts. Each task is categorized by difficulty (easy/medium/hard), with experts constructing the dataset entirely from scratch. We also release a family of LLM-as-a-Judge (7B and 32B) evaluators trained for nuanced assessment of generative outputs. This approach provides scalable, interpretable evaluation and annotation tools for model development, effectively replacing costly and less precise human judgments.

摘要

我们推出POLLUX——一个全面的开源基准测试,旨在评估大型语言模型(LLM)在俄语文本生成任务中的能力。本研究的主要贡献是提出了一种新颖的评估方法,该方法显著提升了LLM评估结果的可解释性。针对每类任务,我们定义了一套详细评估标准,并开发了模型自主评分协议:模型需对生成响应进行评分并给出评分依据。这种方法实现了超越传统人工对比评估的透明化、标准驱动的评估范式,同时避免了传统方法资源消耗大的缺点。POLLUX包含由35类任务构成的精细分类体系,涵盖代码生成、创意写作、实用助手等多个生成领域,共计2100个专业人工精心设计的提示词。所有任务均按难度(简单/中等/困难)分级,且数据集完全由专家从零构建。我们还发布了一系列LLM-as-a-Judge评估模型(7B和32B版本),这些模型经过专门训练以实现对生成内容的精细化评估。该方法为模型开发提供了可扩展、高解释性的评估与标注工具,能有效替代成本高昂且精度不足的人工评判。


AutoChemSchematic AI: A Closed-Loop, Physics-Aware Agentic Framework for Auto-Generating Chemical Process and Instrumentation Diagrams

Abstract

arXiv:2505.24584v1 Announce Type: cross Abstract: Recent advancements in generative AI have accelerated the discovery of novel chemicals and materials; however, transitioning these discoveries to industrial-scale production remains a critical bottleneck, as it requires the development of entirely new chemical manufacturing processes. Current AI methods cannot auto-generate PFDs or PIDs, despite their critical role in scaling chemical processes, while adhering to engineering constraints. We present a closed loop, physics aware framework for the automated generation of industrially viable PFDs and PIDs. The framework integrates domain specialized small scale language models (SLMs) (trained for chemical process QA tasks) with first principles simulation, leveraging three key components: (1) a hierarchical knowledge graph of process flow and instrumentation descriptions for 1,020+ chemicals, (2) a multi-stage training pipeline that fine tunes domain specialized SLMs on synthetic datasets via Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Retrieval-Augmented Instruction Tuning (RAIT), and (3) DWSIM based simulator in the loop validation to ensure feasibility. To improve both runtime efficiency and model compactness, the framework incorporates advanced inference time optimizations including FlashAttention, Lookahead Decoding, PagedAttention with KV-cache quantization, and Test Time Inference Scaling and independently applies structural pruning techniques (width and depth) guided by importance heuristics to reduce model size with minimal accuracy loss. Experiments demonstrate that the framework generates simulator-validated process descriptions with high fidelity, outperforms baseline methods in correctness, and generalizes to unseen chemicals. By bridging AI-driven design with industrial-scale feasibility, this work significantly reduces R&D timelines from lab discovery to plant deployment.

摘要

生成式人工智能的最新进展加速了新型化学品与材料的发现进程,然而将这些发现转化为工业化规模生产仍存在关键瓶颈,因为这需要开发全新的化学制造工艺。尽管工艺流程图(PFD)和管道仪表流程图(PID)在化工过程放大中具有核心作用,但现有AI方法尚无法在满足工程约束条件下自动生成这些图纸。本研究提出一种闭环、物理感知的自动化框架,用于生成具备工业可行性的PFD与PID。该框架通过三个关键组件实现:(1)包含1,020余种化学品工艺流与仪器描述的分层知识图谱;(2)采用监督微调(SFT)、直接偏好优化(DPO)和检索增强指令调优(RAIT)在合成数据集上微调领域专用小型语言模型(SLM)的多阶段训练流程;(3)基于DWSIM的模拟器闭环验证以确保可行性。为提升运行效率与模型紧凑性,框架整合了FlashAttention、前瞻解码、KV缓存量化的分页注意力及测试时推理缩放等先进推理优化技术,并基于重要性启发式准则独立应用宽度与深度结构剪枝以最小精度损失压缩模型规模。实验表明,该框架生成的过程描述经模拟器验证具有高保真度,在正确性上超越基线方法,并能泛化至未见化学品。通过弥合AI驱动设计与工业级可行性之间的鸿沟,本研究显著缩短了从实验室发现到工厂部署的研发周期。


The Hallucination Dilemma: Factuality-Aware Reinforcement Learning for Large Reasoning Models

Abstract

arXiv:2505.24630v1 Announce Type: cross Abstract: Large language models (LLMs) have significantly advanced in reasoning tasks through reinforcement learning (RL) optimization, achieving impressive capabilities across various challenging benchmarks. However, our empirical analysis reveals a critical drawback: reasoning-oriented RL fine-tuning significantly increases the prevalence of hallucinations. We theoretically analyze the RL training dynamics, identifying high-variance gradient, entropy-induced randomness, and susceptibility to spurious local optima as key factors leading to hallucinations. To address this drawback, we propose Factuality-aware Step-wise Policy Optimization (FSPO), an innovative RL fine-tuning algorithm incorporating explicit factuality verification at each reasoning step. FSPO leverages automated verification against given evidence to dynamically adjust token-level advantage values, incentivizing factual correctness throughout the reasoning process. Experiments across mathematical reasoning and hallucination benchmarks using Qwen2.5 and Llama models demonstrate that FSPO effectively reduces hallucinations while enhancing reasoning accuracy, substantially improving both reliability and performance.

摘要

通过强化学习(RL)优化,大语言模型(LLMs)在推理任务上取得了显著进展,在各种具有挑战性的基准测试中展现出令人印象深刻的能力。然而,我们的实证分析揭示了一个关键缺陷:以推理为导向的RL微调显著增加了幻觉现象的发生率。我们从理论上分析了RL训练动态,发现高方差梯度、熵诱导的随机性以及对伪局部最优的敏感性是导致幻觉的关键因素。为解决这一问题,我们提出了事实感知的逐步策略优化(FSPO),这是一种创新的RL微调算法,其在每个推理步骤中引入了显式的事实性验证。FSPO利用对给定证据的自动验证来动态调整词元级优势值,从而在整个推理过程中激励事实正确性。基于Qwen2.5和Llama模型的数学推理和幻觉基准测试表明,FSPO在提升推理准确性的同时有效减少了幻觉现象,显著改善了模型的可靠性和性能。


Multiple LLM Agents Debate for Equitable Cultural Alignment

Abstract

arXiv:2505.24671v1 Announce Type: cross Abstract: Large Language Models (LLMs) need to adapt their predictions to diverse cultural contexts to benefit diverse communities across the world. While previous efforts have focused on single-LLM, single-turn approaches, we propose to exploit the complementary strengths of multiple LLMs to promote cultural adaptability. We introduce a Multi-Agent Debate framework, where two LLM-based agents debate over a cultural scenario and collaboratively reach a final decision. We propose two variants: one where either LLM agents exclusively debate and another where they dynamically choose between self-reflection and debate during their turns. We evaluate these approaches on 7 open-weight LLMs (and 21 LLM combinations) using the NormAd-ETI benchmark for social etiquette norms in 75 countries. Experiments show that debate improves both overall accuracy and cultural group parity over single-LLM baselines. Notably, multi-agent debate enables relatively small LLMs (7-9B) to achieve accuracies comparable to that of a much larger model (27B parameters).

摘要

大型语言模型(LLMs)需要使其预测适应不同的文化背景,以造福全球多元社群。尽管先前研究主要关注单一LLM的单轮处理方法,我们提出利用多个LLM的互补优势来提升文化适应性。我们引入了一个多智能体辩论框架,其中两个基于LLM的智能体就文化场景展开辩论并协作达成最终决策。我们提出两种变体:一种是LLM智能体仅进行辩论,另一种是它们在轮次中动态选择自省或辩论。我们使用涵盖75个国家社交礼仪规范的NormAd-ETI基准测试,对7个开放权重LLM(及21种LLM组合)进行评估。实验表明,与单LLM基线相比,辩论能同时提升整体准确性和文化群体均衡性。值得注意的是,多智能体辩论使得较小规模LLM(7-9B参数)能达到与更大模型(27B参数)相当的准确度。


On Symmetric Losses for Robust Policy Optimization with Noisy Preferences

Abstract

arXiv:2505.24709v1 Announce Type: cross Abstract: Optimizing policies based on human preferences is key to aligning language models with human intent. This work focuses on reward modeling, a core component in reinforcement learning from human feedback (RLHF), and offline preference optimization, such as direct preference optimization. Conventional approaches typically assume accurate annotations. However, real-world preference data often contains noise due to human errors or biases. We propose a principled framework for robust policy optimization under noisy preferences, viewing reward modeling as a classification problem. This allows us to leverage symmetric losses, known for their robustness to label noise in classification, leading to our Symmetric Preference Optimization (SymPO) method. We prove that symmetric losses enable successful policy optimization even under noisy labels, as the resulting reward remains rank-preserving -- a property sufficient for policy improvement. Experiments on synthetic and real-world tasks demonstrate the effectiveness of SymPO.

摘要

基于人类偏好优化策略是将语言模型与人类意图对齐的关键。本研究聚焦于奖励建模——人类反馈强化学习(RLHF)的核心组件,以及离线偏好优化(如直接偏好优化)。传统方法通常假设标注数据准确,然而现实世界的偏好数据常因人为错误或偏差包含噪声。我们提出一个理论框架用于噪声偏好下的鲁棒策略优化,将奖励建模视为分类问题。这使我们能够利用对称损失函数(因其在分类任务中对标签噪声的鲁棒性而闻名),从而发展出对称偏好优化(SymPO)方法。我们证明即使存在噪声标签,对称损失仍能实现成功的策略优化,因为所得奖励保持排序一致性——这一特性足以支撑策略改进。在合成任务和现实任务上的实验验证了SymPO的有效性。


Causal-aware Large Language Models: Enhancing Decision-Making Through Learning, Adapting and Acting

Abstract

arXiv:2505.24710v1 Announce Type: cross Abstract: Large language models (LLMs) have shown great potential in decision-making due to the vast amount of knowledge stored within the models. However, these pre-trained models are prone to lack reasoning abilities and are difficult to adapt to new environments, further hindering their application to complex real-world tasks. To address these challenges, inspired by the human cognitive process, we propose Causal-aware LLMs, which integrate the structural causal model (SCM) into the decision-making process to model, update, and utilize structured knowledge of the environment in a learning-adapting-acting" paradigm. Specifically, in the learning stage, we first utilize an LLM to extract the environment-specific causal entities and their causal relations to initialize a structured causal model of the environment. Subsequently,in the adapting stage, we update the structured causal model through external feedback about the environment, via an idea of causal intervention. Finally, in the acting stage, Causal-aware LLMs exploit structured causal knowledge for more efficient policy-making through the reinforcement learning agent. The above processes are performed iteratively to learn causal knowledge, ultimately enabling the causal-aware LLMs to achieve a more accurate understanding of the environment and make more efficient decisions. Experimental results across 22 diverse tasks within the open-world game Crafter" validate the effectiveness of our proposed method.

摘要

大型语言模型(LLMs)因其内部存储的海量知识,在决策领域展现出巨大潜力。然而这些预训练模型往往缺乏推理能力,且难以适应新环境,这进一步阻碍了其在复杂现实任务中的应用。为解决这些问题,受人类认知过程启发,我们提出因果感知型LLMs,通过将结构因果模型(SCM)整合到决策过程中,以"学习-适应-行动"范式对环境的结构化知识进行建模、更新和利用。具体而言:在学习阶段,首先利用LLM提取环境特异性因果实体及其关系,初始化环境的结构因果模型;在适应阶段,通过因果干预思想,基于环境的外部反馈更新结构化因果模型;在行动阶段,因果感知型LLMs借助强化学习智能体,利用结构化因果知识实现更高效的策略制定。上述过程迭代执行以学习因果知识,最终使因果感知型LLMs能更精准理解环境并做出高效决策。在开放世界游戏"Crafter"中开展的22项多样化任务实验验证了本方法的有效性。


Decoding Knowledge Attribution in Mixture-of-Experts: A Framework of Basic-Refinement Collaboration and Efficiency Analysis

Abstract

arXiv:2505.24593v1 Announce Type: cross Abstract: The interpretability of Mixture-of-Experts (MoE) models, especially those with heterogeneous designs, remains underexplored. Existing attribution methods for dense models fail to capture dynamic routing-expert interactions in sparse MoE architectures. To address this issue, we propose a cross-level attribution algorithm to analyze sparse MoE architectures (Qwen 1.5-MoE, OLMoE, Mixtral-8x7B) against dense models (Qwen 1.5-7B, Llama-7B, Mixtral-7B). Results show MoE models achieve 37% higher per-layer efficiency via a "mid-activation, late-amplification" pattern: early layers screen experts, while late layers refine knowledge collaboratively. Ablation studies reveal a "basic-refinement" framework--shared experts handle general tasks (entity recognition), while routed experts specialize in domain-specific processing (geographic attributes). Semantic-driven routing is evidenced by strong correlations between attention heads and experts (r=0.68), enabling task-aware coordination. Notably, architectural depth dictates robustness: deep Qwen 1.5-MoE mitigates expert failures (e.g., 43% MRR drop in geographic tasks when blocking top-10 experts) through shared expert redundancy, whereas shallow OLMoE suffers severe degradation (76% drop). Task sensitivity further guides design: core-sensitive tasks (geography) require concentrated expertise, while distributed-tolerant tasks (object attributes) leverage broader participation. These insights advance MoE interpretability, offering principles to balance efficiency, specialization, and robustness.

摘要

混合专家模型(MoE)的可解释性,尤其是异构设计模型,目前仍缺乏深入研究。现有针对稠密模型的归因方法无法捕捉稀疏MoE架构中动态路由与专家的交互机制。为此,我们提出跨层级归因算法,对比分析稀疏MoE架构(Qwen 1.5-MoE、OLMoE、Mixtral-8x7B)与稠密模型(Qwen 1.5-7B、Llama-7B、Mixtral-7B)。结果表明,MoE模型通过"中期激活-后期放大"模式实现单层效率提升37%:早期层筛选专家,后期层协同精炼知识。消融实验揭示"基础-精炼"框架——共享专家处理通用任务(实体识别),路由专家专注领域特定处理(地理属性)。注意力头与专家的强相关性(r=0.68)证实语义驱动路由机制,实现任务感知协同。值得注意的是,架构深度决定鲁棒性:深层Qwen 1.5-MoE通过共享专家冗余缓解专家失效(如阻断前10专家时地理任务MRR下降43%),而浅层OLMoE性能急剧退化(下降76%)。任务敏感性进一步指导设计:核心敏感任务(地理)需要专家集中处理,分布容忍任务(对象属性)适合广泛参与。这些发现推进了MoE可解释性研究,为平衡效率、专业化和鲁棒性提供设计原则。


Multi-Domain ABSA Conversation Dataset Generation via LLMs for Real-World Evaluation and Model Comparison

Abstract

arXiv:2505.24701v1 Announce Type: cross Abstract: Aspect-Based Sentiment Analysis (ABSA) offers granular insights into opinions but often suffers from the scarcity of diverse, labeled datasets that reflect real-world conversational nuances. This paper presents an approach for generating synthetic ABSA data using Large Language Models (LLMs) to address this gap. We detail the generation process aimed at producing data with consistent topic and sentiment distributions across multiple domains using GPT-4o. The quality and utility of the generated data were evaluated by assessing the performance of three state-of-the-art LLMs (Gemini 1.5 Pro, Claude 3.5 Sonnet, and DeepSeek-R1) on topic and sentiment classification tasks. Our results demonstrate the effectiveness of the synthetic data, revealing distinct performance trade-offs among the models: DeepSeekR1 showed higher precision, Gemini 1.5 Pro and Claude 3.5 Sonnet exhibited strong recall, and Gemini 1.5 Pro offered significantly faster inference. We conclude that LLM-based synthetic data generation is a viable and flexible method for creating valuable ABSA resources, facilitating research and model evaluation without reliance on limited or inaccessible real-world labeled data.

摘要

基于方面的情感分析(ABSA)能够提供细粒度的意见洞察,但通常面临缺乏反映真实世界对话多样性的标注数据集的问题。本文提出了一种利用大语言模型(LLMs)生成合成ABSA数据的方法以弥补这一不足。我们详细阐述了使用GPT-4o生成具有跨领域一致主题和情感分布的数据流程,并通过评估三种前沿LLM模型(Gemini 1.5 Pro、Claude 3.5 Sonnet和DeepSeek-R1)在主题与情感分类任务上的表现,验证了生成数据的质量与实用性。实验结果表明:DeepSeek-R1展现出更高精确度,Gemini 1.5 Pro与Claude 3.5 Sonnet具有较强召回率,而Gemini 1.5 Pro推理速度显著更快。本研究证实基于LLM的合成数据生成是一种可行且灵活的方法,能够为ABSA研究创建有价值的资源,在不依赖有限或难以获取的真实标注数据的情况下促进模型评估与研究进展。


A survey of using EHR as real-world evidence for discovering and validating new drug indications

Abstract

arXiv:2505.24767v1 Announce Type: cross Abstract: Electronic Health Records (EHRs) have been increasingly used as real-world evidence (RWE) to support the discovery and validation of new drug indications. This paper surveys current approaches to EHR-based drug repurposing, covering data sources, processing methodologies, and representation techniques. It discusses study designs and statistical frameworks for evaluating drug efficacy. Key challenges in validation are discussed, with emphasis on the role of large language models (LLMs) and target trial emulation. By synthesizing recent developments and methodological advances, this work provides a foundational resource for researchers aiming to translate real-world data into actionable drug-repurposing evidence.

摘要

电子健康记录(EHRs)作为真实世界证据(RWE)正日益广泛地用于支持新药适应症的发现与验证。本文系统综述了基于EHRs的药物重定位现有方法,涵盖数据来源、处理流程及表征技术,探讨了评估药物疗效的研究设计与统计框架,重点分析了验证过程中的核心挑战,特别是大语言模型(LLMs)和目标试验模拟的作用。通过整合最新研究进展与方法学创新,本研究为研究者将真实世界数据转化为可操作的药物重定位证据提供了基础性资源。


Drop Dropout on Single-Epoch Language Model Pretraining

Abstract

arXiv:2505.24788v1 Announce Type: cross Abstract: Originally, dropout was seen as a breakthrough regularization technique that reduced overfitting and improved performance in almost all applications of deep learning by reducing overfitting. Yet, single-epoch pretraining tasks common to modern LLMs yield minimal overfitting, leading to dropout not being used for large LLMs. Nevertheless, no thorough empirical investigation has been done on the role of dropout in LM pretraining. Through experiments in single-epoch pretraining of both masked (BERT) and autoregressive (Pythia 160M and 1.4B) LMs with varying levels of dropout, we find that downstream performance in language modeling, morpho-syntax (BLiMP), question answering (SQuAD), and natural-language inference (MNLI) improves when dropout is not applied during pretraining. We additionally find that the recently-introduced "early dropout" also degrades performance over applying no dropout at all. We further investigate the models' editability, and find that models trained without dropout are more successful in gradient-based model editing (MEND) and equivalent in representation-based model editing (ReFT). Therefore, we advocate to drop dropout during single-epoch pretraining.

摘要

最初,dropout被视为一种突破性的正则化技术,它通过减少过拟合在几乎所有深度学习应用中提升了模型性能。然而,现代大语言模型(LLM)常用的单周期预训练任务本身过拟合程度极低,导致dropout未被用于大型LLM。目前尚未有研究系统探讨dropout在语言模型预训练中的作用。我们通过在掩码式(BERT)和自回归式(Pythia 160M和1.4B)语言模型的单周期预训练中控制dropout率进行实验,发现当预训练阶段不使用dropout时,模型在下游任务(语言建模、形态句法BLiMP、问答SQuAD、自然语言推理MNLI)中表现更优。实验还表明,最新提出的"早期dropout"方案相比完全不使用dropout同样会降低性能。我们进一步研究了模型的可编辑性,发现未使用dropout训练的模型在基于梯度的模型编辑(MEND)中表现更佳,在基于表征的模型编辑(ReFT)中表现相当。因此我们建议在单周期预训练中放弃使用dropout。


Vision LLMs Are Bad at Hierarchical Visual Understanding, and LLMs Are the Bottleneck

Abstract

arXiv:2505.24840v1 Announce Type: cross Abstract: This paper reveals that many state-of-the-art large language models (LLMs) lack hierarchical knowledge about our visual world, unaware of even well-established biology taxonomies. This shortcoming makes LLMs a bottleneck for vision LLMs' hierarchical visual understanding (e.g., recognizing Anemone Fish but not Vertebrate). We arrive at these findings using about one million four-choice visual question answering (VQA) tasks constructed from six taxonomies and four image datasets. Interestingly, finetuning a vision LLM using our VQA tasks reaffirms LLMs' bottleneck effect to some extent because the VQA tasks improve the LLM's hierarchical consistency more than the vision LLM's. We conjecture that one cannot make vision LLMs understand visual concepts fully hierarchical until LLMs possess corresponding taxonomy knowledge.

摘要

本文揭示了许多前沿大型语言模型(LLMs)缺乏对视觉世界的层级化知识,甚至对生物学中成熟的分类体系也一无所知。这一缺陷使LLMs成为视觉大模型层级化视觉理解(例如能识别海葵鱼却无法识别脊椎动物)的瓶颈。我们通过构建约百万道四选一视觉问答(VQA)任务得出该结论,这些任务基于六个分类体系和四个图像数据集。有趣的是,使用我们的VQA任务微调视觉大模型时,LLMs的瓶颈效应在一定程度上得到印证——因为VQA任务对LLMs层级一致性的提升效果优于其对视觉大模型的改进。我们推测,除非LLMs本身掌握相应的分类学知识,否则无法使视觉大模型获得完全层级化的视觉概念理解。


PhySense: Principle-Based Physics Reasoning Benchmarking for Large Language Models

Abstract

arXiv:2505.24823v1 Announce Type: cross Abstract: Large language models (LLMs) have rapidly advanced and are increasingly capable of tackling complex scientific problems, including those in physics. Despite this progress, current LLMs often fail to emulate the concise, principle-based reasoning characteristic of human experts, instead generating lengthy and opaque solutions. This discrepancy highlights a crucial gap in their ability to apply core physical principles for efficient and interpretable problem solving. To systematically investigate this limitation, we introduce PhySense, a novel principle-based physics reasoning benchmark designed to be easily solvable by experts using guiding principles, yet deceptively difficult for LLMs without principle-first reasoning. Our evaluation across multiple state-of-the-art LLMs and prompt types reveals a consistent failure to align with expert-like reasoning paths, providing insights for developing AI systems with efficient, robust and interpretable principle-based scientific reasoning.

摘要

大型语言模型(LLMs)发展迅速,日益具备解决复杂科学问题的能力,包括物理学领域的问题。然而,尽管取得了这些进展,当前的LLMs往往无法模拟人类专家简洁、基于原理的推理特征,反而生成冗长且不透明的解决方案。这种差异突显了它们在运用核心物理原理进行高效且可解释问题求解方面的重要缺陷。为了系统地研究这一局限性,我们提出了PhySense——一个基于原理的物理推理基准测试,该测试设计为专家可轻松运用指导原则解决,但对不具备原理优先推理能力的LLMs却具有欺骗性的难度。通过对多种最先进LLMs及提示类型的评估,我们发现它们始终无法与类专家的推理路径保持一致,这为开发具有高效、稳健且可解释的基于原理的科学推理AI系统提供了重要启示。


VideoCAD: A Large-Scale Video Dataset for Learning UI Interactions and 3D Reasoning from CAD Software

Abstract

arXiv:2505.24838v1 Announce Type: cross Abstract: Computer-Aided Design (CAD) is a time-consuming and complex process, requiring precise, long-horizon user interactions with intricate 3D interfaces. While recent advances in AI-driven user interface (UI) agents show promise, most existing datasets and methods focus on short, low-complexity tasks in mobile or web applications, failing to capture the demands of professional engineering tools. In this work, we introduce VideoCAD, the first attempt at engineering UI interaction learning for precision tasks. Specifically, VideoCAD is a large-scale synthetic dataset consisting of over 41K annotated video recordings of CAD operations, generated using an automated framework for collecting high-fidelity UI action data from human-made CAD designs. Compared to existing datasets, VideoCAD offers an order of magnitude higher complexity in UI interaction learning for real-world engineering tasks, having up to a 20x longer time horizon than other datasets. We show two important downstream applications of VideoCAD: learning UI interactions from professional precision 3D CAD tools and a visual question-answering (VQA) benchmark designed to evaluate multimodal large language models' (LLM) spatial reasoning and video understanding abilities. To learn the UI interactions, we propose VideoCADFormer - a state-of-the-art model in learning CAD interactions directly from video, which outperforms multiple behavior cloning baselines. Both VideoCADFormer and the VQA benchmark derived from VideoCAD reveal key challenges in the current state of video-based UI understanding, including the need for precise action grounding, multi-modal and spatial reasoning, and long-horizon dependencies.

摘要

计算机辅助设计(CAD)是一个耗时且复杂的过程,需要用户通过精密的三维界面进行长时间、高精度的交互操作。尽管当前人工智能驱动的用户界面(UI)代理技术展现出潜力,但现有数据集和方法大多聚焦于移动或网页应用中的简短低复杂度任务,无法满足专业工程工具的需求。本研究首次提出面向精密任务的工程化UI交互学习框架VideoCAD,该大规模合成数据集包含41,000余条带标注的CAD操作视频记录,通过自动化框架从人工CAD设计中采集高保真UI动作数据生成。相较于现有数据集,VideoCAD在真实工程任务的UI交互学习复杂度上提升了一个数量级,其时间跨度可达其他数据集的20倍。我们展示了VideoCAD的两大核心应用:基于专业精密三维CAD工具的UI交互学习,以及用于评估多模态大语言模型(LLM)空间推理与视频理解能力的视觉问答(VQA)基准测试。针对UI交互学习,我们提出当前最先进的VideoCADFormer模型——该模型直接从视频学习CAD交互,其性能超越多种行为克隆基线方法。VideoCADFormer及基于VideoCAD构建的VQA基准共同揭示了当前视频化UI理解领域的关键挑战,包括精确动作定位、多模态空间推理以及长时依赖关系等核心问题。


ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

Abstract

arXiv:2505.24864v1 Announce Type: cross Abstract: Recent advances in reasoning-centric language models have highlighted reinforcement learning (RL) as a promising method for aligning models with verifiable rewards. However, it remains contentious whether RL truly expands a model's reasoning capabilities or merely amplifies high-reward outputs already latent in the base model's distribution, and whether continually scaling up RL compute reliably leads to improved reasoning performance. In this work, we challenge prevailing assumptions by demonstrating that prolonged RL (ProRL) training can uncover novel reasoning strategies that are inaccessible to base models, even under extensive sampling. We introduce ProRL, a novel training methodology that incorporates KL divergence control, reference policy resetting, and a diverse suite of tasks. Our empirical analysis reveals that RL-trained models consistently outperform base models across a wide range of pass@k evaluations, including scenarios where base models fail entirely regardless of the number of attempts. We further show that reasoning boundary improvements correlates strongly with task competence of base model and training duration, suggesting that RL can explore and populate new regions of solution space over time. These findings offer new insights into the conditions under which RL meaningfully expands reasoning boundaries in language models and establish a foundation for future work on long-horizon RL for reasoning. We release model weights to support further research: https://huggingface.co/nvidia/Nemotron-Research-Reasoning-Qwen-1.5B

摘要

以推理为核心的语言模型最新进展表明,强化学习(RL)是实现模型与可验证奖励对齐的有效方法。然而,学界仍存在争议:RL究竟是真正扩展了模型的推理能力,还是仅放大了基础模型分布中已有的高奖励输出?持续增加RL计算量是否能可靠提升推理性能?本研究通过证明延长RL训练(ProRL)能够发现基础模型即使经过大量采样也无法获得的新型推理策略,对主流假设提出了挑战。我们提出ProRL这一创新训练方法,整合了KL散度控制、参考策略重置和多样化任务集。实证分析表明,经过RL训练的模型在各类pass@k评估中始终优于基础模型,包括基础模型无论尝试次数多少均完全失败的场景。我们进一步揭示,推理边界的改善程度与基础模型的任务胜任力及训练时长呈强相关性,这表明RL能够随时间推移探索并填充解空间的新区域。这些发现为理解RL在何种条件下能实质性扩展语言模型推理边界提供了新视角,并为未来长周期推理强化学习研究奠定了基础。我们公开模型权重以支持后续研究:https://huggingface.co/nvidia/Nemotron-Research-Reasoning-Qwen-1.5B


Improving Reliability and Explainability of Medical Question Answering through Atomic Fact Checking in Retrieval-Augmented LLMs

Abstract

arXiv:2505.24830v1 Announce Type: cross Abstract: Large language models (LLMs) exhibit extensive medical knowledge but are prone to hallucinations and inaccurate citations, which pose a challenge to their clinical adoption and regulatory compliance. Current methods, such as Retrieval Augmented Generation, partially address these issues by grounding answers in source documents, but hallucinations and low fact-level explainability persist. In this work, we introduce a novel atomic fact-checking framework designed to enhance the reliability and explainability of LLMs used in medical long-form question answering. This method decomposes LLM-generated responses into discrete, verifiable units called atomic facts, each of which is independently verified against an authoritative knowledge base of medical guidelines. This approach enables targeted correction of errors and direct tracing to source literature, thereby improving the factual accuracy and explainability of medical Q&A. Extensive evaluation using multi-reader assessments by medical experts and an automated open Q&A benchmark demonstrated significant improvements in factual accuracy and explainability. Our framework achieved up to a 40% overall answer improvement and a 50% hallucination detection rate. The ability to trace each atomic fact back to the most relevant chunks from the database provides a granular, transparent explanation of the generated responses, addressing a major gap in current medical AI applications. This work represents a crucial step towards more trustworthy and reliable clinical applications of LLMs, addressing key prerequisites for clinical application and fostering greater confidence in AI-assisted healthcare.

摘要

大型语言模型(LLMs)展现出广泛的医学知识,但存在幻觉和错误引用问题,这对其临床应用和法规合规构成挑战。当前方法如检索增强生成通过将答案基于源文档部分解决了这些问题,但幻觉和事实层面可解释性低的问题仍然存在。本研究提出了一种新颖的原子事实核查框架,旨在提升医学长问答中LLMs的可靠性和可解释性。该方法将LLM生成的回答分解为离散、可验证的单元(称为原子事实),每个单元均独立对照权威医学指南知识库进行验证。此方法能针对性纠正错误并直接溯源至文献,从而提升医学问答的事实准确性和可解释性。通过医学专家多读者评估和自动化开放问答基准测试的广泛验证表明,该方法在事实准确性和可解释性上均有显著提升。我们的框架实现了高达40%的整体答案改进和50%的幻觉检测率。将每个原子事实溯源至数据库中最相关段落的能力,为生成回答提供了细粒度、透明的解释,弥补了当前医学AI应用的主要缺陷。这项工作代表了向更可信、可靠的LLM临床应用迈出的关键一步,解决了临床应用的关键前提条件,并为AI辅助医疗增强了信心。


ProxyThinker: Test-Time Guidance through Small Visual Reasoners

Abstract

arXiv:2505.24872v1 Announce Type: cross Abstract: Recent advancements in reinforcement learning with verifiable rewards have pushed the boundaries of the visual reasoning capabilities in large vision-language models (LVLMs). However, training LVLMs with reinforcement fine-tuning (RFT) is computationally expensive, posing a significant challenge to scaling model size. In this work, we propose ProxyThinker, an inference-time technique that enables large models to inherit the visual reasoning capabilities from small, slow-thinking visual reasoners without any training. By subtracting the output distributions of base models from those of RFT reasoners, ProxyThinker modifies the decoding dynamics and successfully elicits the slow-thinking reasoning demonstrated by the emerged sophisticated behaviors such as self-verification and self-correction. ProxyThinker consistently boosts performance on challenging visual benchmarks on spatial, mathematical, and multi-disciplinary reasoning, enabling untuned base models to compete with the performance of their full-scale RFT counterparts. Furthermore, our implementation efficiently coordinates multiple language models with parallelism techniques and achieves up to 38 ×\times faster inference compared to previous decoding-time methods, paving the way for the practical deployment of ProxyThinker. Code is available at https://github.com/MrZilinXiao/ProxyThinker.

摘要

具有可验证奖励的强化学习最新进展,推动了大型视觉语言模型(LVLMs)在视觉推理能力方面的边界。然而,通过强化微调(RFT)训练LVLMs计算成本高昂,对模型规模的扩展构成了重大挑战。在本研究中,我们提出了ProxyThinker,一种推理时技术,使大型模型无需任何训练即可继承小型、慢思考视觉推理器的视觉推理能力。通过从RFT推理器的输出分布中减去基础模型的输出分布,ProxyThinker修改了解码动态,并成功引发了慢思考推理,表现为自我验证和自我纠正等复杂行为的涌现。ProxyThinker在空间、数学和多学科推理等具有挑战性的视觉基准测试中持续提升性能,使未经调优的基础模型能够与其全规模RFT对应模型的性能相媲美。此外,我们的实现通过并行技术高效协调多个语言模型,与之前的解码时方法相比,推理速度提升高达38倍,为ProxyThinker的实际部署铺平了道路。


Compress then Serve: Serving Thousands of LoRA Adapters with Little Overhead

Abstract

arXiv:2407.00066v4 Announce Type: replace Abstract: Fine-tuning large language models (LLMs) with low-rank adaptations (LoRAs) has become common practice, often yielding numerous copies of the same LLM differing only in their LoRA updates. This paradigm presents challenges for systems that serve real-time responses to queries that each involve a different LoRA. Prior works optimize the design of such systems but still require continuous loading and offloading of LoRAs, as it is infeasible to store thousands of LoRAs in GPU memory. To mitigate this issue, we investigate the efficacy of compression when serving LoRAs. We propose a method for the joint compression of LoRAs into a shared basis paired with LoRA-specific scaling matrices. We extend our algorithm to learn clusters of LoRAs that are amenable to joint compression, allowing it to scale gracefully to large LoRA collections. Our experiments with up to 1000 LoRAs demonstrate that compressed LoRAs preserve performance while offering major throughput gains in realistic serving scenarios with over a thousand LoRAs, maintaining 80% of the throughput of serving a single LoRA.

摘要

摘要:采用低秩自适应(LoRA)对大型语言模型(LLM)进行微调已成为常见实践,这通常会产生仅在LoRA更新部分存在差异的多个LLM副本。该范式对实时响应查询的服务系统提出了挑战,尤其是当每个查询涉及不同LoRA时。现有研究虽优化了此类系统的设计,但仍需持续加载和卸载LoRA,因为将数千个LoRA存储在GPU内存中并不可行。为缓解此问题,我们研究了LoRA服务场景中压缩技术的有效性。我们提出了一种将多个LoRA联合压缩至共享基空间的方法,并配以LoRA特定的缩放矩阵。通过扩展算法学习适合联合压缩的LoRA聚类,使其能够优雅地扩展到大规模LoRA集合。在包含多达1000个LoRA的实验表明,压缩后的LoRA在保持性能的同时,可在实际服务场景(涉及上千个LoRA)中实现显著吞吐量提升——维持单个LoRA服务吞吐量的80%。


Harnessing Negative Signals: Reinforcement Distillation from Teacher Data for LLM Reasoning

Abstract

arXiv:2505.24850v1 Announce Type: cross Abstract: Recent advances in model distillation demonstrate that data from advanced reasoning models (e.g., DeepSeek-R1, OpenAI's o1) can effectively transfer complex reasoning abilities to smaller, efficient student models. However, standard practices employ rejection sampling, discarding incorrect reasoning examples -- valuable, yet often underutilized data. This paper addresses the critical question: How can both positive and negative distilled reasoning traces be effectively leveraged to maximize LLM reasoning performance in an offline setting? To this end, We propose Reinforcement Distillation (REDI), a two-stage framework. Stage 1 learns from positive traces via Supervised Fine-Tuning (SFT). Stage 2 further refines the model using both positive and negative traces through our proposed REDI objective. This novel objective is a simple, reference-free loss function that outperforms established methods like DPO and SimPO in this distillation context. Our empirical evaluations demonstrate REDI's superiority over baseline Rejection Sampling SFT or SFT combined with DPO/SimPO on mathematical reasoning tasks. Notably, the Qwen-REDI-1.5B model, post-trained on just 131k positive and negative examples from the open Open-R1 dataset, achieves an 83.1% score on MATH-500 (pass@1). Its performance matches or surpasses that of DeepSeek-R1-Distill-Qwen-1.5B (a model post-trained on 800k proprietary data) across various mathematical reasoning benchmarks, establishing a new state-of-the-art for 1.5B models post-trained offline with openly available data.

摘要

模型蒸馏领域的最新进展表明,来自高级推理模型(如DeepSeek-R1、OpenAI的o1)的数据能有效将复杂推理能力迁移至更小规模的高效学生模型。然而,标准实践采用拒绝采样策略,直接丢弃错误推理样本——这些具有价值却常被忽视的数据。本文探讨核心问题:在离线环境下,如何同时利用正向与负向蒸馏推理轨迹以最大化大语言模型的推理性能?为此,我们提出强化蒸馏(REDI)框架,包含两个阶段:第一阶段通过监督微调(SFT)从正向轨迹学习;第二阶段采用我们提出的REDI目标函数,结合正负样本进一步优化模型。该创新目标函数是一种简洁的无参考损失函数,在蒸馏场景中优于DPO、SimPO等现有方法。实证评估表明,在数学推理任务上,REDI显著超越基于拒绝采样的SFT基线及结合DPO/SimPO的SFT方法。值得注意的是,Qwen-REDI-1.5B模型仅使用开放数据集Open-R1中的13.1万条正负样本进行后训练,即在MATH-500测试集(pass@1)达到83.1%准确率。在多项数学推理基准测试中,其性能持平或超越基于80万条专有数据后训练的DeepSeek-R1-Distill-Qwen-1.5B模型,为基于公开数据的1.5B规模离线后训练模型确立了新标杆。


HELM: Hyperbolic Large Language Models via Mixture-of-Curvature Experts

Abstract

arXiv:2505.24722v1 Announce Type: cross Abstract: Large language models (LLMs) have shown great success in text modeling tasks across domains. However, natural language exhibits inherent semantic hierarchies and nuanced geometric structure, which current LLMs do not capture completely owing to their reliance on Euclidean operations. Recent studies have also shown that not respecting the geometry of token embeddings leads to training instabilities and degradation of generative capabilities. These findings suggest that shifting to non-Euclidean geometries can better align language models with the underlying geometry of text. We thus propose to operate fully in Hyperbolic space, known for its expansive, scale-free, and low-distortion properties. We thus introduce HELM, a family of HypErbolic Large Language Models, offering a geometric rethinking of the Transformer-based LLM that addresses the representational inflexibility, missing set of necessary operations, and poor scalability of existing hyperbolic LMs. We additionally introduce a Mixture-of-Curvature Experts model, HELM-MICE, where each expert operates in a distinct curvature space to encode more fine-grained geometric structure from text, as well as a dense model, HELM-D. For HELM-MICE, we further develop hyperbolic Multi-Head Latent Attention (HMLA) for efficient, reduced-KV-cache training and inference. For both models, we develop essential hyperbolic equivalents of rotary positional encodings and RMS normalization. We are the first to train fully hyperbolic LLMs at billion-parameter scale, and evaluate them on well-known benchmarks such as MMLU and ARC, spanning STEM problem-solving, general knowledge, and commonsense reasoning. Our results show consistent gains from our HELM architectures -- up to 4% -- over popular Euclidean architectures used in LLaMA and DeepSeek, highlighting the efficacy and enhanced reasoning afforded by hyperbolic geometry in large-scale LM pretraining.

摘要

大型语言模型(LLMs)在跨领域文本建模任务中展现出卓越成效。然而自然语言具有固有的语义层级结构和微妙的几何特性,当前LLMs因依赖欧几里得运算而未能完全捕捉这些特征。近期研究还表明,忽视词嵌入的几何特性会导致训练不稳定和生成能力退化。这些发现提示,转向非欧几里得几何能更有效地使语言模型与文本的底层几何结构对齐。为此,我们提出在全双曲空间中进行建模——该空间以具备扩展性、无标度特性和低失真特性著称。我们由此推出HELM(双曲大型语言模型)系列,对基于Transformer的LLMs进行几何重构,解决了现有双曲语言模型存在的表征僵化、必要运算缺失和可扩展性不足等问题。我们还提出混合曲率专家模型HELM-MICE(各专家在特定曲率空间运作以编码更细粒度的文本几何结构)以及稠密模型HELM-D。针对HELM-MICE,我们进一步开发了双曲多头潜在注意力机制(HMLA),实现高效的低KV缓存训练与推理。针对两种模型,我们均开发了关键的双曲等价模块:旋转位置编码和RMS归一化。我们率先实现了十亿参数规模的全双曲LLMs训练,并在MMLU、ARC等涵盖STEM问题求解、通识知识和常识推理的基准测试中进行评估。实验结果表明:相较于LLaMA和DeepSeek采用的欧几里得架构,我们的HELM架构能持续获得最高达4%的性能提升,这印证了双曲几何在大规模语言模型预训练中具有提升推理效能的优势。


Combining Domain and Alignment Vectors to Achieve Better Knowledge-Safety Trade-offs in LLMs

Abstract

arXiv:2411.06824v2 Announce Type: replace Abstract: There is a growing interest in training domain-expert LLMs that excel in specific technical fields compared to their general-purpose instruction-tuned counterparts. However, these expert models often experience a loss in their safety abilities in the process, making them capable of generating harmful content. As a solution, we introduce an efficient and effective merging-based alignment method called \textsc{MergeAlign} that interpolates the domain and alignment vectors, creating safer domain-specific models while preserving their utility. We apply \textsc{MergeAlign} on Llama3 variants that are experts in medicine and finance, obtaining substantial alignment improvements with minimal to no degradation on domain-specific benchmarks. We study the impact of model merging through model similarity metrics and contributions of individual models being merged. We hope our findings open new research avenues and inspire more efficient development of safe expert LLMs.

摘要

与通用指令调优模型相比,训练特定技术领域的专家级大语言模型(LLM)正受到越来越多的关注。然而,这些专家模型在训练过程中往往会出现安全能力下降的问题,可能生成有害内容。为此,我们提出了一种高效且基于模型融合的对齐方法——\textsc{MergeAlign},该方法通过插值领域向量与对齐向量,在保持模型实用性的同时创建更安全的领域专用模型。我们在医学和金融领域的Llama3变体上应用\textsc{MergeAlign},在领域专用基准测试中几乎不损失性能的情况下显著提升了模型安全性。我们通过模型相似性度量及融合模型的个体贡献度分析了模型融合的影响。希望本研究能为安全专家型LLM的高效开发开辟新研究方向并提供启发。


QPO: Query-dependent Prompt Optimization via Multi-Loop Offline Reinforcement Learning

Abstract

arXiv:2408.10504v2 Announce Type: replace Abstract: Prompt engineering has demonstrated remarkable success in enhancing the performance of large language models (LLMs) across diverse tasks. However, most existing prompt optimization methods only focus on the task-level performance, overlooking the importance of query-preferred prompts, which leads to suboptimal performances. Additionally, these methods rely heavily on frequent interactions with LLMs to obtain feedback for guiding the optimization process, incurring substantial redundant interaction costs. In this paper, we introduce Query-dependent Prompt Optimization (QPO), which leverages multi-loop offline reinforcement learning to iteratively fine-tune a small pretrained language model to generate optimal prompts tailored to the input queries, thus significantly improving the prompting effect on the large target LLM. We derive insights from offline prompting demonstration data, which already exists in large quantities as a by-product of benchmarking diverse prompts on open-sourced tasks, thereby circumventing the expenses of online interactions. Furthermore, we continuously augment the offline dataset with the generated prompts in each loop, as the prompts from the fine-tuned model are supposed to outperform the source prompts in the original dataset. These iterative loops bootstrap the model towards generating optimal prompts. Experiments on various LLM scales and diverse NLP and math tasks demonstrate the efficacy and cost-efficiency of our method in both zero-shot and few-shot scenarios.

摘要

提示工程在提升大语言模型(LLM)跨任务性能方面已展现出显著成效。然而现有提示优化方法大多仅关注任务级性能,忽视了查询偏好提示的重要性,导致性能未达最优。此外,这些方法严重依赖与LLM的频繁交互以获取优化反馈,产生了大量冗余交互成本。本文提出查询依赖型提示优化(QPO),通过多轮离线强化学习迭代微调小型预训练语言模型,使其生成适配输入查询的优化提示,从而显著提升目标大模型的提示效果。我们从离线提示示范数据中获取优化依据——这些数据作为开源任务上多样化提示基准测试的副产品已大量存在,从而规避了在线交互开销。进一步地,我们在每轮迭代中利用模型生成的提示持续扩充离线数据集,因为经微调模型产生的提示理应优于原始数据集中的源提示。这种迭代循环推动模型逐步生成最优提示。在不同规模LLM及多样化NLP与数学任务上的实验表明,本方法在零样本和少样本场景下均具有高效性与成本优势。


Reflection-Bench: Evaluating Epistemic Agency in Large Language Models

Abstract

arXiv:2410.16270v2 Announce Type: replace Abstract: With large language models (LLMs) increasingly deployed as cognitive engines for AI agents, the reliability and effectiveness critically hinge on their intrinsic epistemic agency, which remains understudied. Epistemic agency, the ability to flexibly construct, adapt, and monitor beliefs about dynamic environments, represents a base-model-level capacity independent of specific tools, modules, or applications. We characterize the holistic process underlying epistemic agency, which unfolds in seven interrelated dimensions: prediction, decision-making, perception, memory, counterfactual thinking, belief updating, and meta-reflection. Correspondingly, we propose Reflection-Bench, a cognitive-psychology-inspired benchmark consisting of seven tasks with long-term relevance and minimization of data leakage. Through a comprehensive evaluation of 16 models using three prompting strategies, we identify a clear three-tier performance hierarchy and significant limitations of current LLMs, particularly in meta-reflection capabilities. While state-of-the-art LLMs demonstrate rudimentary signs of epistemic agency, our findings suggest several promising research directions, including enhancing core cognitive functions, improving cross-functional coordination, and developing adaptive processing mechanisms. Our code and data are available at https://github.com/AI45Lab/ReflectionBench.

摘要

随着大型语言模型(LLMs)越来越多地被部署为人工智能代理的认知引擎,其可靠性和有效性关键取决于其内在的认知能动性,而这一领域尚未得到充分研究。认知能动性是指能够灵活构建、调整并监控对动态环境的信念的基础模型级能力,独立于特定工具、模块或应用。我们系统刻画了认知能动性的整体运作过程,该过程体现在七个相互关联的维度:预测、决策、感知、记忆、反事实思考、信念更新和元反思。据此,我们提出Reflection-Bench——一个受认知心理学启发的基准测试,包含七个具有长期相关性且数据泄露最小化的任务。通过对16个模型采用三种提示策略的全面评估,我们发现了明显的三级性能分层,并揭示出当前LLMs(尤其在元反思能力方面)存在显著局限。虽然最先进的LLMs已展现出认知能动性的初步迹象,但我们的研究结果指出了若干有前景的研究方向,包括增强核心认知功能、改善跨功能协调能力,以及开发自适应处理机制。代码与数据详见https://github.com/AI45Lab/ReflectionBench。


AMXFP4: Taming Activation Outliers with Asymmetric Microscaling Floating-Point for 4-bit LLM Inference

Abstract

arXiv:2411.09909v2 Announce Type: replace Abstract: As large language models (LLMs) grow in parameter size and context length, computation precision has been reduced from 16-bit to 4-bit to improve inference efficiency. However, this reduction causes accuracy degradation due to activation outliers. Rotation-based INT4 methods address this via matrix calibration, but they introduce multi-hour overheads and leave key computations in full precision. Microscaling (MX) floating-point (FP) formats offer fine-grained representation with a shared scale, enabling fully quantized matrix multiplications through direct casting without calibration. However, existing research shows unsatisfactory empirical results for MXFP4 inference, and the robustness of MX formats remains largely unexplored. In this work, we uncover the fundamental tradeoffs of the MX format: while it effectively suppresses activation outliers, it does so at the cost of increased group-wise asymmetry. To address this, we propose AMXFP4, a 4-bit asymmetric FP format that handles both issues using asymmetric shared scales, without requiring calibration. Our custom MAC engine adds negligible hardware cost while improving accuracy: AMXFP4 outperforms MXFP4 by 3% on VQA and exceeds rotation-based methods by 1.6% on CSQA. It also surpasses recently deployed commercial MXFP4 variants. Code: https://github.com/aiha-lab/MX-QLLM

摘要

随着大型语言模型(LLM)参数量与上下文长度的增长,计算精度已从16位降至4位以提升推理效率。然而,这种降低会因激活值异常点导致精度下降。基于旋转的INT4方法通过矩阵校准解决该问题,但会引入数小时开销且关键计算仍保留全精度。微缩放(MX)浮点(FP)格式通过共享尺度提供细粒度表征,无需校准即可直接转换实现全量化矩阵乘法。但现有研究表明MXFP4推理的实证结果不尽如人意,且MX格式的鲁棒性仍待探索。本文揭示了MX格式的基本权衡:虽能有效抑制激活值异常点,却以增加分组非对称性为代价。为此,我们提出AMXFP4——一种采用非对称共享尺度处理双重问题的4位非对称FP格式,无需校准。定制化MAC引擎在几乎不增加硬件成本的同时提升精度:AMXFP4在VQA任务上优于MXFP4达3%,在CSQA任务上较基于旋转的方法提高1.6%,并超越近期部署的商业MXFP4变体。代码见:https://github.com/aiha-lab/MX-QLLM


ChinaTravel: An Open-Ended Benchmark for Language Agents in Chinese Travel Planning

Abstract

arXiv:2412.13682v3 Announce Type: replace Abstract: Recent advances in LLMs, particularly in language reasoning and tool integration, have rapidly sparked the \emph{Language Agents} for real-world development. Among these, travel planning represents a prominent domain, combining complex multi-objective planning challenges with practical deployment demands. However, existing benchmarks often oversimplify real-world requirements by focusing on synthetic queries and limited constraints. We address the gap of evaluating language agents in multi-day, multi-POI travel planning scenarios with diverse and open human needs. Specifically, we introduce \emph{ChinaTravel}, the first open-ended benchmark grounded in authentic Chinese travel requirements collected from 1,154 human participants. We design a compositionally generalizable domain-specific language (DSL) for scalable evaluation, covering feasibility, constraint satisfaction, and preference comparison. Empirical studies reveal the potential of neuro-symbolic agents in travel planning, achieving a 37.0% constraint satisfaction rate on human queries, a 10\times improvement over purely neural models. These findings highlight ChinaTravel as a pivotal milestone for advancing language agents in complex, real-world planning scenarios.

摘要

大语言模型(LLMs)在语言推理和工具集成方面的最新进展,迅速推动了面向现实世界开发的\emph{语言智能体}发展。其中,旅行规划作为典型领域,兼具复杂的多目标规划挑战与实际部署需求。然而,现有基准测试往往过度简化现实要求,仅关注合成查询和有限约束条件。本研究针对多日、多POI旅行规划场景中多样化且开放的人类需求,填补了语言智能体评估的空白。具体而言,我们提出了基于1,154名人类参与者真实中国旅行需求的首个开放式基准测试\emph{ChinaTravel},并设计了一种具备组合泛化能力的领域特定语言(DSL)以实现可扩展评估,涵盖可行性、约束满足和偏好比较等维度。实证研究表明神经符号智能体在旅行规划中具有潜力,对人类查询的约束满足率达到37.0%,较纯神经模型提升10倍。这些发现标志着ChinaTravel成为推动语言智能体应对复杂现实规划场景的关键里程碑。


ReSo: A Reward-driven Self-organizing LLM-based Multi-Agent System for Reasoning Tasks

Abstract

arXiv:2503.02390v3 Announce Type: replace Abstract: Multi-agent systems (MAS) have emerged as a promising approach for enhancing the reasoning capabilities of large language models in complex problem-solving; however, current MAS frameworks suffer from poor flexibility and scalability with underdeveloped optimization strategies. To address these challenges, we propose ReSo, which integrates task graph generation with a reward-driven two-stage agent selection process centered on our Collaborative Reward Model that provides fine-grained reward signals to optimize MAS cooperation. We also introduce an automated data synthesis framework for generating MAS benchmarks without any human annotations. Experimental results show that ReSo matches or outperforms existing methods, achieving 33.7 percent accuracy on Math-MAS and 32.3 percent accuracy on SciBench-MAS, where other approaches completely fail.

摘要

多智能体系统(MAS)作为一种增强大语言模型在复杂问题求解中推理能力的方法已展现出巨大潜力,然而现有MAS框架存在灵活性差、可扩展性不足以及优化策略欠完善等问题。为解决这些挑战,我们提出ReSo框架,该框架将任务图生成与基于协作奖励模型的双阶段智能体选择机制相结合——该模型通过细粒度奖励信号优化MAS协作过程。我们还开发了无需人工标注的自动化数据合成框架用于生成MAS基准测试集。实验结果表明,ReSo在Math-MAS和SciBench-MAS测试集上分别达到33.7%和32.3%的准确率,与现有方法持平或更优,而其他方法在这些测试集上完全失效。


You need to MIMIC to get FAME: Solving Meeting Transcript Scarcity with a Multi-Agent Conversations

Abstract

arXiv:2502.13001v2 Announce Type: replace Abstract: Meeting summarization suffers from limited high-quality data, mainly due to privacy restrictions and expensive collection processes. We address this gap with FAME, a dataset of 500 meetings in English and 300 in German produced by MIMIC, our new multi-agent meeting synthesis framework that generates meeting transcripts on a given knowledge source by defining psychologically grounded participant profiles, outlining the conversation, and orchestrating a large language model (LLM) debate. A modular post-processing step refines these outputs, mitigating potential repetitiveness and overly formal tones, ensuring coherent, credible dialogues at scale. We also propose a psychologically grounded evaluation framework assessing naturalness, social behavior authenticity, and transcript difficulties. Human assessments show that FAME approximates real-meeting spontaneity (4.5/5 in naturalness), preserves speaker-centric challenges (3/5 in spoken language), and introduces richer information-oriented difficulty (4/5 in difficulty). These findings highlight that FAME is a good and scalable proxy for real-world meeting conditions. It enables new test scenarios for meeting summarization research and other conversation-centric applications in tasks requiring conversation data or simulating social scenarios under behavioral constraints.

摘要

会议摘要领域长期面临高质量数据匮乏的挑战,主要源于隐私限制与高昂的数据采集成本。本研究通过FAME数据集(包含500场英文会议与300场德文会议)填补这一空白,该数据集由我们提出的MIMIC多智能体会议合成框架生成。该框架基于给定知识源,通过定义心理学基础的参与者画像、规划对话流程并协调大语言模型(LLM)辩论来生成会议转录文本。模块化的后处理步骤优化输出结果,有效缓解重复性与过度正式化问题,确保大规模生成连贯可信的对话。我们还提出基于心理学的评估框架,从自然度、社交行为真实性和转录难度三个维度进行测评。人工评估显示FAME高度还原真实会议的自发性(自然度4.5/5),保留以发言者为核心的语言挑战(口语特征3/5),同时引入更丰富的信息导向型难度(难度4/5)。这些发现表明FAME能有效模拟真实会议场景,为会议摘要研究及其他对话中心型应用(包括需要对话数据或模拟行为约束下社交场景的任务)提供了可扩展的新型测试环境。


A Global Dataset Mapping the AI Innovation from Academic Research to Industrial Patents

Abstract

arXiv:2503.09257v5 Announce Type: replace Abstract: In the rapidly evolving field of artificial intelligence (AI), mapping innovation patterns and understanding effective technology transfer from research to applications are essential for economic growth. However, existing data infrastructures suffer from fragmentation, incomplete coverage, and insufficient evaluative capacity. Here, we present DeepInnovationAI, a comprehensive global dataset containing three structured files. DeepPatentAI.csv: Contains 2,356,204 patent records with 8 field-specific attributes. DeepDiveAI.csv: Encompasses 3,511,929 academic publications with 13 metadata fields. These two datasets leverage large language models, multilingual text analysis and dual-layer BERT classifiers to accurately identify AI-related content, while utilizing hypergraph analysis to create robust innovation metrics. Additionally, DeepCosineAI.csv: By applying semantic vector proximity analysis, this file contains 3,511,929 most relevant paper-patent pairs, each described by 3 metadata fields, to facilitate the identification of potential knowledge flows. DeepInnovationAI enables researchers, policymakers, and industry leaders to anticipate trends and identify collaboration opportunities. With extensive temporal and geographical scope, it supports detailed analysis of technological development patterns and international competition dynamics, establishing a foundation for modeling AI innovation and technology transfer processes.

摘要

在人工智能(AI)快速发展的领域中,描绘创新模式并理解从研究到应用的有效技术转移对经济增长至关重要。然而,现有数据基础设施存在碎片化、覆盖不全和评估能力不足等问题。本研究推出DeepInnovationAI——一个包含三个结构化文件的全球综合数据集:DeepPatentAI.csv包含2,356,204条专利记录,涵盖8个领域特定属性;DeepDiveAI.csv包含3,511,929篇学术出版物,涉及13个元数据字段。这两个数据集利用大语言模型、多语言文本分析和双层BERT分类器精准识别AI相关内容,同时通过超图分析构建稳健的创新指标。此外,DeepCosineAI.csv应用语义向量邻近度分析,包含3,511,929组最相关的论文-专利配对(每组含3个元数据字段),以促进潜在知识流的识别。DeepInnovationAI使研究人员、政策制定者和行业领袖能够预判趋势并发现合作机遇。凭借广泛的时间跨度和地理覆盖范围,该数据集支持对技术发展模式和国际竞争动态的细粒度分析,为建模AI创新与技术转移过程奠定基础。


AgentNet: Decentralized Evolutionary Coordination for LLM-based Multi-Agent Systems

Abstract

arXiv:2504.00587v2 Announce Type: replace Abstract: The rapid advancement of large language models (LLMs) has enabled the development of multi-agent systems where multiple LLM-based agents collaborate on complex tasks. However, existing systems often rely on centralized coordination, leading to scalability bottlenecks, reduced adaptability, and single points of failure. Privacy and proprietary knowledge concerns further hinder cross-organizational collaboration, resulting in siloed expertise. We propose AgentNet, a decentralized, Retrieval-Augmented Generation (RAG)-based framework that enables LLM-based agents to specialize, evolve, and collaborate autonomously in a dynamically structured Directed Acyclic Graph (DAG). Unlike prior approaches with static roles or centralized control, AgentNet allows agents to adjust connectivity and route tasks based on local expertise and context. AgentNet introduces three key innovations: (1) a fully decentralized coordination mechanism that eliminates the need for a central orchestrator, enhancing robustness and emergent intelligence; (2) dynamic agent graph topology that adapts in real time to task demands, ensuring scalability and resilience; and (3) a retrieval-based memory system for agents that supports continual skill refinement and specialization. By minimizing centralized control and data exchange, AgentNet enables fault-tolerant, privacy-preserving collaboration across organizations. Experiments show that AgentNet achieves higher task accuracy than both single-agent and centralized multi-agent baselines.

摘要

大型语言模型(LLM)的快速发展推动了多智能体系统的进步,其中多个基于LLM的智能体能够协作完成复杂任务。然而,现有系统通常依赖集中式协调,导致可扩展性瓶颈、适应性降低以及单点故障问题。隐私与专有知识问题进一步阻碍了跨组织协作,形成信息孤岛。我们提出AgentNet——一个基于检索增强生成(RAG)的去中心化框架,使基于LLM的智能体能够在动态构建的有向无环图(DAG)中实现专业化、自主演进与协作。与先前采用静态角色或集中控制的方法不同,AgentNet允许智能体根据本地专业知识和上下文调整连接关系并路由任务。该框架包含三项关键创新:(1)完全去中心化的协调机制,无需中央协调器即可增强系统鲁棒性与涌现智能;(2)动态智能体图拓扑结构,可实时适应任务需求,确保可扩展性与弹性;(3)基于检索的智能体记忆系统,支持持续技能优化与专业化。通过最小化集中控制与数据交换,AgentNet实现了跨组织的容错且保护隐私的协作。实验表明,AgentNet在任务准确性上优于单智能体与集中式多智能体基线系统。


Principled Understanding of Generalization for Generative Transformer Models in Arithmetic Reasoning Tasks

Abstract

arXiv:2407.17963v2 Announce Type: replace-cross Abstract: Transformer-based models excel in various tasks but their generalization capabilities, especially in arithmetic reasoning, remain incompletely understood. Arithmetic tasks provide a controlled framework to explore these capabilities, yet performance anomalies persist, such as inconsistent effectiveness in multiplication and erratic generalization in modular addition (e.g., modulo 100 vs. 101). This paper develops a unified theoretical framework for understanding the generalization behaviors of transformers in arithmetic tasks, focusing on length generalization. Through detailed analysis of addition, multiplication, and modular operations, we reveal that translation invariance in addition aligns with relative positional encoding for robust generalization, while base mismatch in modular operations disrupts this alignment. Experiments across GPT-family models validate our framework, confirming its ability to predict generalization behaviors. Our work highlights the importance of task structure and training data distribution for achieving data-efficient and structure-aware training, providing a systematic approach to understanding of length generalization in transformers.

摘要

基于Transformer的模型在各类任务中表现卓越,但其泛化能力——尤其在算术推理领域——仍未得到充分理解。算术任务为探索这些能力提供了可控框架,但性能异常现象持续存在,例如乘法运算效果的不稳定性及模加法中泛化的不可预测性(如模100与101的差异)。本文构建了一个统一的理论框架来理解Transformer在算术任务中的泛化行为,重点关注长度泛化问题。通过对加法、乘法和模运算的细致分析,我们发现加法的平移不变性与相对位置编码相契合从而实现稳健泛化,而模运算中的基数失配则会破坏这种一致性。在GPT系列模型上的实验验证了本框架预测泛化行为的有效性。本研究揭示了任务结构与训练数据分布对于实现数据高效和结构感知训练的重要性,为理解Transformer的长度泛化机制提供了系统性方法。


DemoShapley: Valuation of Demonstrations for In-Context Learning

Abstract

arXiv:2410.07523v2 Announce Type: replace-cross Abstract: Large language models (LLMs) using in-context learning (ICL) excel in many tasks without task-specific fine-tuning. However, demonstration selection and ordering greatly impact ICL effectiveness. To address this, we propose DemoShapley and Beta-DemoShapley, inspired by Data Shapley and Beta Shapley, to assess the influence of individual demonstrations. DemoShapley captures how each example influences performance in different contexts, unlike other influence-based methods that rely on a fixed number of demonstrations. Beta-DemoShapley further enhances this framework by incorporating the Beta distribution, allowing users to assign higher weights to smaller cardinalities, which aligns with ICL's prompt length and computational constraints. Our findings show that the proposed algorithms improve model performance by selecting quality demonstrations, and enhancing generalization to out-of-distribution tasks. It also identifies noise-compromised data and promotes fairness in LLMs, protecting model performance and ensuring robustness across various scenarios.

摘要

采用上下文学习(ICL)的大型语言模型(LLM)无需针对特定任务进行微调即可在多项任务中表现优异。然而,演示样本的选择与排序会显著影响ICL的效果。为此,我们受数据Shapley和Beta Shapley启发,提出DemoShapley和Beta-DemoShapley方法来评估单个演示样本的影响力。与依赖固定数量演示样本的其他基于影响力的方法不同,DemoShapley能捕捉每个示例在不同上下文场景中对模型性能的影响。Beta-DemoShapley通过引入Beta分布进一步优化该框架,允许用户为较小基数分配更高权重,从而契合ICL的提示长度与计算约束条件。实验表明,所提算法通过筛选优质演示样本提升了模型性能,并增强了对分布外任务的泛化能力。该方法还能识别噪声干扰数据,促进LLM的公平性,在保护模型性能的同时确保各类场景下的鲁棒性。


VITA: Towards Open-Source Interactive Omni Multimodal LLM

Abstract

arXiv:2408.05211v3 Announce Type: replace-cross Abstract: The remarkable multimodal capabilities and interactive experience of GPT-4o underscore their necessity in practical applications, yet open-source models rarely excel in both areas. In this paper, we introduce VITA, the first-ever open-source Multimodal Large Language Model (MLLM) adept at simultaneous processing and analysis of Video, Image, Text, and Audio modalities, and meanwhile has an advanced multimodal interactive experience. Starting from Mixtral 8x7B as a language foundation, we expand its Chinese vocabulary followed by bilingual instruction tuning. We further endow the language model with visual and audio capabilities through two-stage multi-task learning of multimodal alignment and instruction tuning. VITA demonstrates robust foundational capabilities of multilingual, vision, and audio understanding, as evidenced by its strong performance across a range of both unimodal and multimodal benchmarks. Beyond foundational capabilities, we have made considerable progress in enhancing the natural multimodal human-computer interaction experience. VITA is the first step for the open-source community to explore the seamless integration of multimodal understanding and interaction. While there is still lots of work to be done on VITA to get close to close-source counterparts, we hope that its role as a pioneer can serve as a cornerstone for subsequent research. Project Page: https://vita-home.github.io.

摘要

GPT-4o卓越的多模态能力和交互体验凸显了其在实际应用中的必要性,但开源模型很少能在这两方面同时表现出色。本文介绍了VITA,这是首个能够同时处理和分析视频、图像、文本及音频模态的开源多模态大语言模型(MLLM),同时具备先进的多模态交互体验。我们以Mixtral 8x7B作为语言基础,扩展其中文词汇库后进行了双语指令微调。随后通过多模态对齐与指令微调的两阶段多任务学习,赋予该语言模型视觉与听觉能力。VITA在多语言、视觉和听觉理解方面展现出强大的基础能力,其在一系列单模态和多模态基准测试中的优异表现即为明证。除基础能力外,我们在提升自然多模态人机交互体验方面也取得了显著进展。VITA是开源社区探索多模态理解与交互无缝整合的第一步。尽管VITA仍需大量工作以接近闭源模型的水平,我们希望其作为先驱者的角色能为后续研究奠定基石。项目页面:https://vita-home.github.io。


NativQA: Multilingual Culturally-Aligned Natural Query for LLMs

Abstract

arXiv:2407.09823v3 Announce Type: replace-cross Abstract: Natural Question Answering (QA) datasets play a crucial role in evaluating the capabilities of large language models (LLMs), ensuring their effectiveness in real-world applications. Despite the numerous QA datasets that have been developed and some work has been done in parallel, there is a notable lack of a framework and large scale region-specific datasets queried by native users in their own languages. This gap hinders the effective benchmarking and the development of fine-tuned models for regional and cultural specificities. In this study, we propose a scalable, language-independent framework, NativQA, to seamlessly construct culturally and regionally aligned QA datasets in native languages, for LLM evaluation and tuning. We demonstrate the efficacy of the proposed framework by designing a multilingual natural QA dataset, MultiNativQA, consisting of ~64k manually annotated QA pairs in seven languages, ranging from high to extremely low resource, based on queries from native speakers from 9 regions covering 18 topics. We benchmark open- and closed-source LLMs with the MultiNativQA dataset. We made the MultiNativQA dataset(https://huggingface.co/datasets/QCRI/MultiNativQA), and other experimental scripts(https://gitlab.com/nativqa/multinativqa) publicly available for the community.

摘要

自然问答(QA)数据集在评估大语言模型(LLM)能力方面发挥着关键作用,确保其在现实应用中的有效性。尽管目前已开发出众多QA数据集并有部分并行研究工作,但显著缺乏由母语用户以自身语言查询的框架化、大规模区域特异性数据集。这一空白阻碍了针对区域与文化特性的有效基准测试和微调模型的发展。本研究提出一个可扩展、语言无关的框架NativQA,用于无缝构建与文化和区域对齐的母语QA数据集,以支持LLM评估与调优。我们通过设计多语言自然QA数据集MultiNativQA验证该框架的有效性,该数据集包含约6.4万条人工标注的QA对,涵盖7种从高资源到极低资源的语言,基于来自9个地区母语者提出的18个主题查询。我们使用MultiNativQA数据集对开源和闭源LLM进行基准测试。MultiNativQA数据集(https://huggingface.co/datasets/QCRI/MultiNativQA)及其他实验脚本(https://gitlab.com/nativqa/multinativqa)已向社区公开。


On the Vulnerability of Applying Retrieval-Augmented Generation within Knowledge-Intensive Application Domains

Abstract

arXiv:2409.17275v2 Announce Type: replace-cross Abstract: Retrieval-Augmented Generation (RAG) has been empirically shown to enhance the performance of large language models (LLMs) in knowledge-intensive domains such as healthcare, finance, and legal contexts. Given a query, RAG retrieves relevant documents from a corpus and integrates them into the LLMs' generation process. In this study, we investigate the adversarial robustness of RAG, focusing specifically on examining the retrieval system. First, across 225 different setup combinations of corpus, retriever, query, and targeted information, we show that retrieval systems are vulnerable to universal poisoning attacks in medical Q&amp;A. In such attacks, adversaries generate poisoned documents containing a broad spectrum of targeted information, such as personally identifiable information. When these poisoned documents are inserted into a corpus, they can be accurately retrieved by any users, as long as attacker-specified queries are used. To understand this vulnerability, we discovered that the deviation from the query's embedding to that of the poisoned document tends to follow a pattern in which the high similarity between the poisoned document and the query is retained, thereby enabling precise retrieval. Based on these findings, we develop a new detection-based defense to ensure the safe use of RAG. Through extensive experiments spanning various Q&amp;A domains, we observed that our proposed method consistently achieves excellent detection rates in nearly all cases.

摘要

检索增强生成(RAG)已被实证研究证明能够提升大语言模型(LLM)在医疗、金融和法律等知识密集型领域的表现。该方法通过从语料库中检索与查询相关的文档,并将其整合至LLM的生成过程中实现增强。本研究聚焦于RAG的对抗鲁棒性,重点考察其检索系统的安全性。首先,我们在医学问答领域构建了225种不同的语料库、检索器、查询及目标信息组合场景,证明检索系统普遍存在投毒攻击漏洞:攻击者可生成包含各类目标信息(如个人身份信息)的污染文档,当这些文档被植入语料库后,只要用户使用攻击者预设的查询语句,污染文档就能被精准检索。通过分析发现,该漏洞源于查询嵌入向量与污染文档嵌入向量的偏移遵循特定模式——污染文档与查询间的高相似度特征得以保留,从而确保精确检索。基于此发现,我们开发了一种基于检测的新型防御机制以保障RAG的安全使用。跨多个问答领域的实验表明,该检测方法在几乎所有案例中均能保持优异的检出率。


Abstract

arXiv:2410.13460v2 Announce Type: replace-cross Abstract: Many court systems are overwhelmed all over the world, leading to huge backlogs of pending cases. Effective triage systems, like those in emergency rooms, could ensure proper prioritization of open cases, optimizing time and resource allocation in the court system. In this work, we introduce the Criticality Prediction dataset, a novel resource for evaluating case prioritization. Our dataset features a two-tier labeling system: (1) the binary LD-Label, identifying cases published as Leading Decisions (LD), and (2) the more granular Citation-Label, ranking cases by their citation frequency and recency, allowing for a more nuanced evaluation. Unlike existing approaches that rely on resource-intensive manual annotations, we algorithmically derive labels leading to a much larger dataset than otherwise possible. We evaluate several multilingual models, including both smaller fine-tuned models and large language models in a zero-shot setting. Our results show that the fine-tuned models consistently outperform their larger counterparts, thanks to our large training set. Our results highlight that for highly domain-specific tasks like ours, large training sets are still valuable.

摘要

全球许多法院系统不堪重负,导致待审案件大量积压。有效的案件分流机制(类似于急诊室的分诊系统)能够确保未决案件得到合理优先级排序,从而优化司法系统的时间和资源配置。本研究推出"关键性预测数据集",作为评估案件优先级的新型资源。该数据集采用双层标注体系:(1) 二元LD标签,用于识别被列为指导性案例(LD)的案件;(2) 更细粒度的引用标签,根据案件被引用频率和新近度进行分级,实现更精细的评估。与依赖资源密集型人工标注的现有方法不同,我们通过算法生成标签,从而构建出规模远超传统方法的数据集。我们评估了包括微调小型模型和零样本设置下的大语言模型在内的多种多语言模型。结果表明,得益于大规模训练集,经过微调的模型始终优于参数量更大的模型。本研究证实,对于此类高度专业化的领域任务,大规模训练集仍具有重要价值。


Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents

Abstract

arXiv:2410.02644v4 Announce Type: replace-cross Abstract: Although LLM-based agents, powered by Large Language Models (LLMs), can use external tools and memory mechanisms to solve complex real-world tasks, they may also introduce critical security vulnerabilities. However, the existing literature does not comprehensively evaluate attacks and defenses against LLM-based agents. To address this, we introduce Agent Security Bench (ASB), a comprehensive framework designed to formalize, benchmark, and evaluate the attacks and defenses of LLM-based agents, including 10 scenarios (e.g., e-commerce, autonomous driving, finance), 10 agents targeting the scenarios, over 400 tools, 27 different types of attack/defense methods, and 7 evaluation metrics. Based on ASB, we benchmark 10 prompt injection attacks, a memory poisoning attack, a novel Plan-of-Thought backdoor attack, 4 mixed attacks, and 11 corresponding defenses across 13 LLM backbones. Our benchmark results reveal critical vulnerabilities in different stages of agent operation, including system prompt, user prompt handling, tool usage, and memory retrieval, with the highest average attack success rate of 84.30%, but limited effectiveness shown in current defenses, unveiling important works to be done in terms of agent security for the community. We also introduce a new metric to evaluate the agents' capability to balance utility and security. Our code can be found at https://github.com/agiresearch/ASB.

摘要

尽管基于大型语言模型(LLM)的智能体能够借助外部工具和记忆机制解决复杂现实任务,但其也可能引入严重的安全漏洞。然而现有研究尚未系统评估针对此类智能体的攻击与防御方法。为此,我们提出Agent安全基准(ASB)——一个涵盖10个应用场景(如电子商务、自动驾驶、金融)、10个场景专用智能体、400余种工具、27类攻防方法及7项评估指标的综合性框架,用于规范化、基准测试和评估LLM智能体的安全性。基于ASB,我们在13个LLM主干模型上测试了10种提示注入攻击、1种记忆污染攻击、1种新颖的思维链后门攻击、4种混合攻击及11种对应防御措施。基准测试结果表明:智能体在系统提示处理、用户指令解析、工具调用及记忆检索等环节均存在严重漏洞(最高平均攻击成功率84.30%),而现有防御措施效果有限,这揭示了智能体安全领域亟待解决的关键问题。我们还提出新指标以评估智能体在功能效用与安全性间的平衡能力。代码已开源:https://github.com/agiresearch/ASB。


EvalYaks: Instruction Tuning Datasets and LoRA Fine-tuned Models for Automated Scoring of CEFR B2 Speaking Assessment Transcripts

Abstract

arXiv:2408.12226v2 Announce Type: replace-cross Abstract: Relying on human experts to evaluate CEFR speaking assessments in an e-learning environment creates scalability challenges, as it limits how quickly and widely assessments can be conducted. We aim to automate the evaluation of CEFR B2 English speaking assessments in e-learning environments from conversation transcripts. First, we evaluate the capability of leading open source and commercial Large Language Models (LLMs) to score a candidate's performance across various criteria in the CEFR B2 speaking exam in both global and India-specific contexts. Next, we create a new expert-validated, CEFR-aligned synthetic conversational dataset with transcripts that are rated at different assessment scores. In addition, new instruction-tuned datasets are developed from the English Vocabulary Profile (up to CEFR B2 level) and the CEFR-SP WikiAuto datasets. Finally, using these new datasets, we perform parameter efficient instruction tuning of Mistral Instruct 7B v0.2 to develop a family of models called EvalYaks. Four models in this family are for assessing the four sections of the CEFR B2 speaking exam, one for identifying the CEFR level of vocabulary and generating level-specific vocabulary, and another for detecting the CEFR level of text and generating level-specific text. EvalYaks achieved an average acceptable accuracy of 96%, a degree of variation of 0.35 levels, and performed 3 times better than the next best model. This demonstrates that a 7B parameter LLM instruction tuned with high-quality CEFR-aligned assessment data can effectively evaluate and score CEFR B2 English speaking assessments, offering a promising solution for scalable, automated language proficiency evaluation.

摘要

在电子学习环境中依赖人工专家评估CEFR口语测试会引发可扩展性挑战,因为这限制了评估实施的效率与范围。本研究旨在通过对话转录实现电子学习环境下CEFR B2级英语口语测试的自动化评估。首先,我们评估了主流开源与商业大语言模型(LLMs)在全球及印度特定语境下对CEFR B2口语考试各评分标准的打分能力。其次,我们创建了经专家验证、符合CEFR标准的新型合成对话数据集,该数据集包含具有不同评分等级的对话转录文本。此外,基于《英语词汇大纲》(至CEFR B2级)和CEFR-SP WikiAuto数据集开发了新的指令微调数据集。最终利用这些数据集对Mistral Instruct 7B v0.2进行参数高效指令微调,开发出名为EvalYaks的系列模型。该系列包含四个针对CEFR B2口语考试四大板块的评估模型,一个用于识别词汇CEFR等级并生成对应等级词汇的模型,以及一个用于检测文本CEFR等级并生成对应等级文本的模型。EvalYaks实现了96%的平均可接受准确率、0.35个等级的程度变异,其性能优于次优模型达3倍。这表明:采用高质量CEFR对齐评估数据进行指令微调的70亿参数LLM能有效评估CEFR B2英语口语测试,为可扩展的自动化语言能力评估提供了可行解决方案。


SVIP: Towards Verifiable Inference of Open-source Large Language Models

Abstract

arXiv:2410.22307v2 Announce Type: replace-cross Abstract: The ever-increasing size of open-source Large Language Models (LLMs) renders local deployment impractical for individual users. Decentralized computing has emerged as a cost-effective solution, allowing individuals and small companies to perform LLM inference for users using surplus computational power. However, a computing provider may stealthily substitute the requested LLM with a smaller, less capable model without consent from users, thereby benefiting from cost savings. We introduce SVIP, a secret-based verifiable LLM inference protocol. Unlike existing solutions based on cryptographic or game-theoretic techniques, our method is computationally effective and does not rest on strong assumptions. Our protocol requires the computing provider to return both the generated text and processed hidden representations from LLMs. We then train a proxy task on these representations, effectively transforming them into a unique model identifier. With our protocol, users can reliably verify whether the computing provider is acting honestly. A carefully integrated secret mechanism further strengthens its security. We thoroughly analyze our protocol under multiple strong and adaptive adversarial scenarios. Our extensive experiments demonstrate that SVIP is accurate, generalizable, computationally efficient, and resistant to various attacks. Notably, SVIP achieves false negative rates below 5% and false positive rates below 3%, while requiring less than 0.01 seconds per prompt query for verification.

摘要

随着开源大语言模型(LLMs)规模的持续扩大,本地部署对个人用户而言已不切实际。去中心化计算作为一种经济高效的解决方案应运而生,使个人和小型企业能够利用闲置算力为用户提供LLM推理服务。然而,计算服务提供商可能在未经用户同意的情况下,暗中将请求的大模型替换为能力较弱的较小模型,从而非法牟利。我们提出SVIP协议——一种基于密钥的可验证LLM推理方案。与现有基于密码学或博弈论的解决方案不同,我们的方法计算高效且无需依赖强假设。该协议要求计算服务提供商同时返回生成文本和LLM处理的隐含表征,我们通过这些表征训练代理任务,将其转化为唯一的模型标识符。借助该协议,用户可可靠验证计算服务提供商是否诚信履约。精心集成的密钥机制进一步增强了安全性。我们在多种强适应性对抗场景下对协议进行了全面分析。大量实验表明,SVIP具有准确性高、泛化性强、计算高效且能抵抗多种攻击的特点。特别值得注意的是,SVIP在每提示词验证耗时不足0.01秒的情况下,实现了低于5%的假阴性率和3%的假阳性率。


"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization

Abstract

arXiv:2411.02355v3 Announce Type: replace-cross Abstract: Quantization is a powerful tool for accelerating large language model (LLM) inference, but the accuracy-performance trade-offs across different formats remain unclear. In this paper, we conduct the most comprehensive empirical study to date, evaluating FP8, INT8, and INT4 quantization across academic benchmarks and real-world tasks on the entire Llama-3.1 model family. Through over 500,000 evaluations, our investigation yields several key findings: (1) FP8 (W8A8-FP) is effectively lossless across all model scales, (2) well-tuned INT8 (W8A8-INT) achieves surprisingly low (1-3%) accuracy degradation, and (3) INT4 weight-only (W4A16-INT) is more competitive than expected, rivaling 8-bit quantization. Further, we investigate the optimal quantization format for different deployments by analyzing inference performance through the popular vLLM framework. Our analysis provides clear deployment recommendations: W4A16 is the most cost-efficient for synchronous setups, while W8A8 dominates in asynchronous continuous batching. For mixed workloads, the optimal choice depends on the specific use case. Our findings offer practical, data-driven guidelines for deploying quantized LLMs at scale -- ensuring the best balance between speed, efficiency, and accuracy.

摘要

量化是加速大语言模型(LLM)推理的强大工具,但不同格式在精度与性能间的权衡仍不明确。本文开展了迄今为止最全面的实证研究,在Llama-3.1全系列模型上评估了FP8、INT8和INT4量化在学术基准和实际任务中的表现。通过超过50万次评估,我们获得以下关键发现:(1)FP8(W8A8-FP)在所有模型规模上基本无损;(2)调优良好的INT8(W8A8-INT)可实现惊人的低精度损失(1-3%);(3)INT4仅权重量化(W4A16-INT)比预期更具竞争力,性能接近8位量化。此外,我们通过主流vLLM框架分析推理性能,探究了不同部署场景下的最优量化格式。分析结果给出明确的部署建议:W4A16在同步设置中性价比最高,而W8A8在异步连续批处理中占优。对于混合工作负载,最优选择取决于具体用例。本研究为大规模部署量化LLM提供了基于数据的实用指南,确保在速度、效率和精度之间实现最佳平衡。


One-Step is Enough: Sparse Autoencoders for Text-to-Image Diffusion Models

Abstract

arXiv:2410.22366v3 Announce Type: replace-cross Abstract: For large language models (LLMs), sparse autoencoders (SAEs) have been shown to decompose intermediate representations that often are not interpretable directly into sparse sums of interpretable features, facilitating better control and subsequent analysis. However, similar analyses and approaches have been lacking for text-to-image models. We investigate the possibility of using SAEs to learn interpretable features for SDXL Turbo, a few-step text-to-image diffusion model. To this end, we train SAEs on the updates performed by transformer blocks within SDXL Turbo's denoising U-net in its 1-step setting. Interestingly, we find that they generalize to 4-step SDXL Turbo and even to the multi-step SDXL base model (i.e., a different model) without additional training. In addition, we show that their learned features are interpretable, causally influence the generation process, and reveal specialization among the blocks. We do so by creating RIEBench, a representation-based image editing benchmark, for editing images while they are generated by turning on and off individual SAE features. This allows us to track which transformer blocks' features are the most impactful depending on the edit category. Our work is the first investigation of SAEs for interpretability in text-to-image diffusion models and our results establish SAEs as a promising approach for understanding and manipulating the internal mechanisms of text-to-image models.

摘要

对于大型语言模型(LLMs),稀疏自编码器(SAEs)已被证明能够将通常无法直接解释的中间表示分解为可解释特征的稀疏和,从而促进更好的控制和后续分析。然而,类似的分析方法在文本到图像模型中仍较为缺乏。本研究探讨了利用SAEs为SDXL Turbo(一种少步数文本到图像扩散模型)学习可解释特征的可行性。为此,我们在SDXL Turbo去噪U-net的单步设置中,对其变压器块执行的更新进行SAEs训练。有趣的是,我们发现这些SAEs无需额外训练即可泛化至4步SDXL Turbo,甚至适用于多步SDXL基础模型(即不同模型)。此外,我们证明所学习的特征具有可解释性,能够因果影响生成过程,并揭示各块之间的功能专化。为此,我们创建了RIEBench(基于表示的图像编辑基准),通过在生成过程中激活或关闭单个SAE特征来编辑图像。这使我们能够根据不同编辑类别追踪哪些变压器块的特征最具影响力。本研究是首次将SAEs应用于文本到图像扩散模型可解释性分析的探索,其结果确立了SAEs作为理解和操纵文本到图像模型内部机制的一种有前景的方法。


Towards Robust and Efficient Federated Low-Rank Adaptation with Heterogeneous Clients

Abstract

arXiv:2410.22815v2 Announce Type: replace-cross Abstract: Federated fine-tuning for Large Language Models (LLMs) faces significant challenges due to the heavy communication overhead of transmitting large model updates. Although Low Rank Adaptation (LoRA) has been proposed as a solution, yet its application in federated learning is complicated by discordance in aggregation. Existing methods addressing this discordance often suffer from performance degradation at low ranks in heterogeneous data settings. In response, we introduce LoRA-A2^2 (Low Rank Adaptation with Alternating freeze and Adaptive rank selection), which demonstrates robustness in challenging settings with low ranks and high data heterogeneity. Our experimental findings reveal that LoRA-A2^2 maintains performance even under extreme heterogeneity and low rank conditions, achieving up to a significant reduction in uploaded parameters compared to full fine-tuning without compromising performance. This adaptive mechanism increases robustness and communication efficiency in federated fine-tuning, enabling the practical deployment of LLMs in resource-constrained environments.

摘要

大型语言模型(LLM)的联邦微调因传输大规模模型更新带来的沉重通信开销而面临重大挑战。尽管低秩自适应(LoRA)被提出作为解决方案,但其在联邦学习中的应用因聚合不一致问题而复杂化。现有解决该不一致性的方法在异构数据设置下常面临低秩时性能下降的问题。为此,我们提出LoRA-A2^2(交替冻结与自适应秩选择的低秩自适应),该方法在低秩和高数据异质性的挑战性环境中展现出鲁棒性。实验结果表明,LoRA-A2^2即使在极端异质性和低秩条件下仍能保持性能,与全参数微调相比,在不损失性能的前提下实现了上传参数量的显著降低。这种自适应机制增强了联邦微调的鲁棒性和通信效率,使得LLM在资源受限环境中的实际部署成为可能。


LlamaDuo: LLMOps Pipeline for Seamless Migration from Service LLMs to Small-Scale Local LLMs

Abstract

arXiv:2408.13467v3 Announce Type: replace-cross Abstract: The widespread adoption of cloud-based proprietary large language models (LLMs) has introduced significant challenges, including operational dependencies, privacy concerns, and the necessity of continuous internet connectivity. In this work, we introduce an LLMOps pipeline, "LlamaDuo", for the seamless migration of knowledge and abilities from service-oriented LLMs to smaller, locally manageable models. This pipeline is crucial for ensuring service continuity in the presence of operational failures, strict privacy policies, or offline requirements. Our LlamaDuo involves fine-tuning a small language model against the service LLM using a synthetic dataset generated by the latter. If the performance of the fine-tuned model falls short of expectations, it is automatically improved through additional fine-tuning using extra similar data generated by the service LLM. This multi-turn process guarantees that the smaller model can eventually match or even surpass the service LLM's capabilities in specific downstream tasks, offering a practical and scalable solution for managing AI deployments in constrained environments. Extensive experiments with leading-edge LLMs are conducted to demonstrate the effectiveness, adaptability, and affordability of LlamaDuo across various downstream tasks. Our pipeline implementation is available at https://github.com/deep-diver/llamaduo.

摘要

基于云的专有大语言模型(LLMs)的广泛采用带来了重大挑战,包括运营依赖性、隐私问题以及持续互联网连接的必要性。本研究提出了一种名为"LlamaDuo"的LLMOps流程,用于将面向服务的大语言模型的知识与能力无缝迁移至更小、可本地管理的模型。该流程对于在出现运营故障、严格隐私政策或离线需求时确保服务连续性至关重要。我们的LlamaDuo通过使用服务LLM生成的合成数据集对小语言模型进行微调。若微调后模型的性能未达预期,系统会自动利用服务LLM生成的额外相似数据进行补充微调。这种多轮迭代过程确保小模型最终能在特定下游任务中匹配甚至超越服务LLM的能力,为受限环境中的AI部署提供了实用且可扩展的解决方案。我们通过对前沿LLMs的大量实验,验证了LlamaDuo在各种下游任务中的有效性、适应性和经济性。流程实现代码已发布于https://github.com/deep-diver/llamaduo。


ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom

Abstract

arXiv:2410.14138v3 Announce Type: replace-cross Abstract: Large vision-language models (LVLMs) have witnessed significant progress on visual understanding tasks. However, they often prioritize language knowledge over image information on visual reasoning tasks, incurring performance degradation. To tackle this issue, we first identify the drawbacks of existing solutions (i.e., limited multi-modal reasoning capacities, and insufficient and irrelevant visual descriptions). We then decompose visual reasoning process into two stages: proactive visual perception (i.e., eyesight) and textual reasoning (i.e., wisdom), and introduce a novel visual reasoning framework named ProReason. This framework features decoupled vision-reasoning capabilities and multi-run proactive perception. Briefly, given a multi-modal question, ProReason iterates proactive information collection and reasoning until the answer can be concluded with necessary and sufficient visual descriptions. Notably, the disassociation of capabilities allows seamless integration of existing large language models (LLMs) to compensate for the reasoning deficits of LVLMs. Our extensive experiments demonstrate that ProReason outperforms existing multi-step reasoning frameworks on various benchmarks for both open-source and closed-source models, with the average performance gain reaching 13.2%. Besides, the integration of LLMs allows ProReason to produce high-quality visual reasoning data, which empowers ProReason-distilled models (i.e., ProReason-VL and ProReason-Q3) to achieve superior performance in downstream tasks. Our insights into existing solutions and the decoupled perspective for feasible integration of LLMs illuminate future research on visual reasoning techniques, especially LLM-assisted ones.

摘要

大型视觉语言模型(LVLM)在视觉理解任务上取得了显著进展。然而,在视觉推理任务中,它们往往优先考虑语言知识而非图像信息,导致性能下降。为解决这一问题,我们首先分析了现有解决方案的缺陷(即多模态推理能力有限、视觉描述不足且不相关)。随后将视觉推理过程分解为两个阶段:主动视觉感知(即视力)与文本推理(即智慧),并提出名为ProReason的新型视觉推理框架。该框架具有解耦的视觉-推理能力和多轮主动感知特性。简言之,给定多模态问题时,ProReason会迭代执行主动信息收集与推理,直至获得必要且充分的视觉描述以得出结论。值得注意的是,这种能力解耦设计可无缝集成现有大语言模型(LLM),弥补LVLM的推理缺陷。大量实验表明,ProReason在开源和闭源模型的各种基准测试中均优于现有多步推理框架,平均性能提升达13.2%。此外,通过集成LLM,ProReason能生成高质量视觉推理数据,使得经ProReason蒸馏的模型(即ProReason-VL和ProReason-Q3)在下游任务中表现优异。我们对现有解决方案的洞察以及LLM可行集成的解耦视角,为未来视觉推理技术(特别是LLM辅助方向)的研究提供了重要启示。


Star Attention: Efficient LLM Inference over Long Sequences

Abstract

arXiv:2411.17116v3 Announce Type: replace-cross Abstract: Inference with Transformer-based Large Language Models (LLMs) on long sequences is both costly and slow due to the quadratic complexity of the self-attention mechanism. We introduce Star Attention, a two-phase block-sparse approximation that improves computational efficiency by sharding attention across multiple hosts while minimizing communication overhead. In the first phase, the context is processed using blockwise-local attention across hosts, in parallel. In the second phase, query and response tokens attend to all prior cached tokens through sequence-global attention. Star Attention integrates seamlessly with most Transformer-based LLMs trained with global attention, reducing memory requirements and inference time by up to 11x while preserving 97-100% of accuracy.

摘要

基于Transformer的大语言模型(LLM)在长序列推理中由于自注意力机制的二次复杂度,存在计算成本高且速度慢的问题。本文提出星型注意力机制,这是一种两阶段的块稀疏近似方法,通过将注意力分片到多个主机上并最小化通信开销,显著提升了计算效率。第一阶段采用跨主机的块局部并行注意力处理上下文;第二阶段则通过序列全局注意力使查询和响应标记关注所有先前缓存的标记。星型注意力可无缝集成至大多数基于全局注意力训练的Transformer类LLM,在保持97-100%准确率的同时,将内存需求和推理时间降低高达11倍。


HEIE: MLLM-Based Hierarchical Explainable AIGC Image Implausibility Evaluator

Abstract

arXiv:2411.17261v2 Announce Type: replace-cross Abstract: AIGC images are prevalent across various fields, yet they frequently suffer from quality issues like artifacts and unnatural textures. Specialized models aim to predict defect region heatmaps but face two primary challenges: (1) lack of explainability, failing to provide reasons and analyses for subtle defects, and (2) inability to leverage common sense and logical reasoning, leading to poor generalization. Multimodal large language models (MLLMs) promise better comprehension and reasoning but face their own challenges: (1) difficulty in fine-grained defect localization due to the limitations in capturing tiny details, and (2) constraints in providing pixel-wise outputs necessary for precise heatmap generation. To address these challenges, we propose HEIE: a novel MLLM-Based Hierarchical Explainable Image Implausibility Evaluator. We introduce the CoT-Driven Explainable Trinity Evaluator, which integrates heatmaps, scores, and explanation outputs, using CoT to decompose complex tasks into subtasks of increasing difficulty and enhance interpretability. Our Adaptive Hierarchical Implausibility Mapper synergizes low-level image features with high-level mapper tokens from LLMs, enabling precise local-to-global hierarchical heatmap predictions through an uncertainty-based adaptive token approach. Moreover, we propose a new dataset: Expl-AIGI-Eval, designed to facilitate interpretable implausibility evaluation of AIGC images. Our method demonstrates state-of-the-art performance through extensive experiments. Our project is at https://yfthu.github.io/HEIE/.

摘要

AIGC图像在各领域广泛应用,但普遍存在伪影和非自然纹理等质量问题。现有专业模型虽能预测缺陷区域热力图,却面临两大挑战:(1)缺乏可解释性,无法对细微缺陷提供原因分析;(2)难以运用常识逻辑推理,导致泛化能力不足。多模态大语言模型(MLLMs)虽具备更强理解推理能力,但存在:(1)受限于微小细节捕捉能力,难以实现细粒度缺陷定位;(2)无法输出像素级结果以生成精确热力图。为此,我们提出HEIE框架——基于MLLM的分层可解释图像不合理性评估器。通过创新性设计:1)思维链驱动的可解释三元评估器,整合热力图、评分与解释输出,利用思维链将复杂任务分解为递进难度的子任务以增强可解释性;2)自适应分层不合理性映射器,将底层图像特征与LLM高层映射标记协同融合,通过基于不确定性的自适应标记方法实现局部到全局的精确分层热力图预测。此外,我们构建了Expl-AIGI-Eval数据集以支持AIGC图像可解释性评估。大量实验表明本方法达到最先进性能。项目详见https://yfthu.github.io/HEIE/。


Enhancing LLM-based Hatred and Toxicity Detection with Meta-Toxic Knowledge Graph

Abstract

arXiv:2412.15268v3 Announce Type: replace-cross Abstract: The rapid growth of social media platforms has raised significant concerns regarding online content toxicity. When Large Language Models (LLMs) are used for toxicity detection, two key challenges emerge: 1) the absence of domain-specific toxic knowledge leads to false negatives; 2) the excessive sensitivity of LLMs to toxic speech results in false positives, limiting freedom of speech. To address these issues, we propose a novel method called MetaTox, leveraging graph search on a meta-toxic knowledge graph to enhance hatred and toxicity detection. First, we construct a comprehensive meta-toxic knowledge graph by utilizing LLMs to extract toxic information through a three-step pipeline, with toxic benchmark datasets serving as corpora. Second, we query the graph via retrieval and ranking processes to supplement accurate, relevant toxic knowledge. Extensive experiments and in-depth case studies across multiple datasets demonstrate that our MetaTox significantly decreases the false positive rate while boosting overall toxicity detection performance. Our code is available at https://github.com/YiboZhao624/MetaTox.

摘要

社交媒体的快速发展引发了人们对网络内容毒性的高度关注。当大型语言模型(LLMs)用于毒性检测时,面临两大关键挑战:1)缺乏领域特异性毒性知识导致假阴性;2)LLMs对毒性言论过度敏感引发假阳性,限制言论自由。为解决这些问题,我们提出名为MetaTox的新方法,通过元毒性知识图谱的图搜索来增强仇恨和毒性检测。首先,我们构建了全面的元毒性知识图谱,利用LLMs通过三步流程从毒性基准数据集中提取毒性信息作为语料库。其次,通过检索和排序流程查询该图谱,以补充准确相关的毒性知识。跨多个数据集的广泛实验和深入案例研究表明,我们的MetaTox在显著降低假阳性率的同时,整体毒性检测性能得到提升。代码已开源:https://github.com/YiboZhao624/MetaTox。


DEMO: Reframing Dialogue Interaction with Fine-grained Element Modeling

Abstract

arXiv:2412.04905v4 Announce Type: replace-cross Abstract: Large language models (LLMs) enabled dialogue systems have become one of the central modes in human-machine interaction, which bring about vast amounts of conversation logs and increasing demand for dialogue generation. The dialogue's life-cycle spans from \textit&#123;Prelude&#125; through \textit&#123;Interlocution&#125; to \textit&#123;Epilogue&#125;, encompassing rich dialogue elements. Despite large volumes of dialogue-related studies, there is a lack of systematic investigation into the dialogue stages to frame benchmark construction that covers comprehensive dialogue elements. This hinders the precise modeling, generation and assessment of LLMs-based dialogue systems. To bridge this gap, in this paper, we introduce a new research task--\textbf&#123;D&#125;ialogue \textbf&#123;E&#125;lement \textbf&#123;MO&#125;deling, including \textit&#123;Element Awareness&#125; and \textit&#123;Dialogue Agent Interaction&#125;, and propose a novel benchmark, \textbf&#123;DEMO&#125;, designed for a comprehensive dialogue modeling and assessment. On this basis, we further build the DEMO agent with the adept ability to model dialogue elements via imitation learning. Extensive experiments on DEMO indicate that current representative LLMs still have considerable potential for enhancement, and our DEMO agent performs well in both dialogue element modeling and out-of-domain tasks.

摘要

大型语言模型(LLMs)驱动的对话系统已成为人机交互的核心模式之一,其产生了海量对话日志并持续提升对话生成需求。对话生命周期涵盖从序幕经对谈至尾声的全过程,包含丰富的对话要素。尽管存在大量对话相关研究,目前仍缺乏对对话阶段的系统性考察以构建覆盖完整对话要素的基准框架,这阻碍了基于LLMs的对话系统在精准建模、生成与评估方面的发展。为填补这一空白,本文提出对话要素建模(DEMO)新研究任务——包含要素感知与对话代理交互两个维度,并构建了面向全要素对话建模与评估的新型基准DEMO。在此基础上,我们进一步通过模仿学习开发了具备对话要素建模能力的DEMO智能体。在DEMO基准上的大量实验表明,当前代表性LLMs仍存在显著提升空间,而我们的DEMO智能体在对话要素建模和跨领域任务中均表现优异。


Enhancing Table Recognition with Vision LLMs: A Benchmark and Neighbor-Guided Toolchain Reasoner

Abstract

arXiv:2412.20662v3 Announce Type: replace-cross Abstract: Pre-trained foundation models have recently made significant progress in table-related tasks such as table understanding and reasoning. However, recognizing the structure and content of unstructured tables using Vision Large Language Models (VLLMs) remains under-explored. To bridge this gap, we propose a benchmark based on a hierarchical design philosophy to evaluate the recognition capabilities of VLLMs in training-free scenarios. Through in-depth evaluations, we find that low-quality image input is a significant bottleneck in the recognition process. Drawing inspiration from this, we propose the Neighbor-Guided Toolchain Reasoner (NGTR) framework, which is characterized by integrating diverse lightweight tools for visual operations aimed at mitigating issues with low-quality images. Specifically, we transfer a tool selection experience from a similar neighbor to the input and design a reflection module to supervise the tool invocation process. Extensive experiments on public datasets demonstrate that our approach significantly enhances the recognition capabilities of the vanilla VLLMs. We believe that the benchmark and framework could provide an alternative solution to table recognition.

摘要

预训练基础模型在表格理解与推理等相关任务中近期取得显著进展。然而,利用视觉大语言模型(VLLMs)识别非结构化表格的布局与内容仍存在研究空白。为填补这一差距,我们提出基于分层设计理念的基准测试,用于评估VLLMs在免训练场景下的识别能力。通过深入评估发现,低质量图像输入是识别过程中的关键瓶颈。受此启发,我们提出邻域引导工具链推理器(NGTR)框架,其特点在于集成多种轻量级视觉操作工具以缓解低质图像问题。具体而言,我们将相似邻域的工具体验迁移至输入样本,并设计反射模块监督工具调用流程。在公开数据集上的大量实验表明,该方法显著提升了原始VLLMs的识别能力。我们相信该基准与框架能为表格识别提供替代性解决方案。


Assessing Social Alignment: Do Personality-Prompted Large Language Models Behave Like Humans?

Abstract

arXiv:2412.16772v2 Announce Type: replace-cross Abstract: The ongoing revolution in language modeling has led to various novel applications, some of which rely on the emerging social abilities of large language models (LLMs). Already, many turn to the new cyber friends for advice during the pivotal moments of their lives and trust them with the deepest secrets, implying that accurate shaping of the LLM's personality is paramount. To this end, state-of-the-art approaches exploit a vast variety of training data, and prompt the model to adopt a particular personality. We ask (i) if personality-prompted models behave (i.e., make decisions when presented with a social situation) in line with the ascribed personality (ii) if their behavior can be finely controlled. We use classic psychological experiments, the Milgram experiment and the Ultimatum Game, as social interaction testbeds and apply personality prompting to open- and closed-source LLMs from 4 different vendors. Our experiments reveal failure modes of the prompt-based modulation of the models' behavior that are shared across all models tested and persist under prompt perturbations. These findings challenge the optimistic sentiment toward personality prompting generally held in the community.

摘要

语言模型领域的持续革新催生了多种新颖应用,其中部分应用依赖于大语言模型(LLM)新兴的社会化能力。当前已有许多人在人生关键时刻向这些数字伴侣寻求建议,并向其倾诉最隐秘的私事,这表明精确塑造LLM的人格特质至关重要。为此,最先进的方法利用海量训练数据,并通过提示引导模型适配特定人格。本研究探讨:(i)人格提示模型的行为(即在社交情境中做出决策时)是否符合预设人格特征;(ii)其行为能否被精细调控。我们采用米尔格拉姆实验和最后通牒博弈这两个经典心理学实验作为社交互动测试平台,对4家厂商的开源和闭源LLM实施人格提示。实验发现,所有测试模型均存在提示调制的共性失效模式,且该现象在提示扰动下持续存在。这些发现对学界普遍持有的人格提示乐观态度提出了挑战。


RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios

Abstract

arXiv:2412.08972v2 Announce Type: replace-cross Abstract: This paper introduces RuleArena, a novel and challenging benchmark designed to evaluate the ability of large language models (LLMs) to follow complex, real-world rules in reasoning. Covering three practical domains -- airline baggage fees, NBA transactions, and tax regulations -- RuleArena assesses LLMs' proficiency in handling intricate natural language instructions that demand long-context understanding, logical reasoning, and accurate mathematical computation. Two key attributes distinguish RuleArena from traditional rule-based reasoning benchmarks: (1) it extends beyond standard first-order logic representations, and (2) it is grounded in authentic, practical scenarios, providing insights into the suitability and reliability of LLMs for real-world applications. Our findings reveal several notable limitations in LLMs: (1) they struggle to identify and apply the appropriate rules, frequently becoming confused by similar but distinct regulations, (2) they cannot consistently perform accurate mathematical computations, even when they correctly identify the relevant rules, and (3) in general, they perform poorly in the benchmark. We also observe a significant performance boost when LLMs are provided with external tools for oracle math and logic operations. These results highlight significant challenges and promising research directions in advancing LLMs' rule-guided reasoning capabilities in real-life applications. Our codes and data are publicly available on https://github.com/skyriver-2000/RuleArena.

摘要

本文介绍了一种新颖且具有挑战性的基准测试RuleArena,旨在评估大型语言模型(LLMs)在推理过程中遵循复杂现实规则的能力。该基准涵盖三个实际领域——航空行李费用、NBA交易规则和税收法规,通过需要长文本理解、逻辑推理和精确数学计算的复杂自然语言指令,评估LLMs的处理能力。RuleArena区别于传统基于规则的推理基准的两个关键特征是:(1)其超越了标准的一阶逻辑表示;(2)基于真实实践场景,为LLMs在现实应用中的适用性和可靠性提供洞见。我们的研究发现LLMs存在若干显著局限:(1)难以识别并应用正确规则,经常混淆相似但不同的法规;(2)即使正确识别相关规则,也无法持续执行精确的数学计算;(3)总体在基准测试中表现欠佳。研究还发现,当为LLMs提供数学计算和逻辑运算的外部工具时,其性能会显著提升。这些结果凸显了提升LLMs在现实应用中规则引导推理能力所面临的重大挑战和潜在研究方向。我们的代码和数据已公开在https://github.com/skyriver-2000/RuleArena。


Analyzing limits for in-context learning

Abstract

arXiv:2502.03503v2 Announce Type: replace-cross Abstract: We examine limits of in-context learning (ICL) in transformer models trained from scratch, focusing on function approximation tasks as a controlled setting to uncover fundamental behaviors. While we show empirically that transformer models can generalize, approximating unseen classes of polynomial (non linear) functions, they cannot generalize beyond certain values. We provide both empirical and mathematical arguments explaining that these limitations stem from architectural components, namely layer normalization and the attention scoring function, softmax. Together, our findings reveal structural constraints on ICL that are often masked in more complex NLP tasks but that need to be understood to improve robustness and interpretability in transformer-based models.

摘要

我们研究了从头训练的Transformer模型中上下文学习(ICL)的局限性,重点关注函数逼近任务这一受控环境,以揭示其基本行为。实验表明,虽然Transformer模型能够通过泛化逼近未见过的多项式(非线性)函数类,但其泛化能力存在特定数值范围的限制。我们通过实证和数学论证指出,这些限制源于架构组件——特别是层归一化和注意力评分函数softmax。本研究揭示了ICL的结构性约束,这些约束在更复杂的自然语言处理任务中常被掩盖,但对于提升基于Transformer模型的鲁棒性和可解释性至关重要。


Autonomy-of-Experts Models

Abstract

arXiv:2501.13074v2 Announce Type: replace-cross Abstract: Mixture-of-Experts (MoE) models mostly use a router to assign tokens to specific expert modules, activating only partial parameters and often outperforming dense models. We argue that the separation between the router's decision-making and the experts' execution is a critical yet overlooked issue, leading to suboptimal expert selection and ineffective learning. To address this, we propose Autonomy-of-Experts (AoE), a novel MoE paradigm in which experts autonomously select themselves to process inputs. AoE is based on the insight that an expert is aware of its own capacity to effectively process a token, an awareness reflected in the scale of its internal activations. In AoE, routers are removed; instead, experts pre-compute internal activations for inputs and are ranked based on their activation norms. Only the top-ranking experts proceed with the forward pass, while the others abort. The overhead of pre-computing activations is reduced through a low-rank weight factorization. This self-evaluating-then-partner-comparing approach ensures improved expert selection and effective learning. We pre-train language models having 700M up to 4B parameters, demonstrating that AoE outperforms traditional MoE models with comparable efficiency.

摘要

混合专家(MoE)模型通常通过路由器将标记分配给特定专家模块,仅激活部分参数,其性能常优于密集模型。本文指出,路由器决策与专家执行之间的分离是一个关键但被忽视的问题,这会导致专家选择次优和学习效率低下。为解决该问题,我们提出自主专家(AoE)这一新型MoE范式,其中专家可自主选择处理输入。AoE基于以下洞见:专家能够感知自身有效处理标记的能力,这种感知反映在其内部激活的规模上。在AoE中,路由器被移除;取而代之的是专家预先计算输入的内部激活,并根据激活范数进行排序。仅排名靠前的专家继续执行前向传播,其余则中止计算。通过低秩权重分解,预计算激活的开销得以降低。这种"自评估-伙伴比较"的方法确保了更优的专家选择和高效学习。我们预训练了参数量从7亿到40亿不等的语言模型,证明AoE在效率相当的情况下优于传统MoE模型。


A Training-Free Length Extrapolation Approach for LLMs: Greedy Attention Logit Interpolation (GALI)

Abstract

arXiv:2502.02659v2 Announce Type: replace-cross Abstract: Transformer-based Large Language Models (LLMs) struggle with inputs exceeding their training context window due to positional out-of-distribution (O.O.D.) issues that disrupt attention. Existing solutions, including fine-tuning and training-free methods, face challenges like inefficiency, redundant interpolation, logit outliers, or loss of local positional information. We propose Greedy Attention Logit Interpolation (GALI), a training-free method that improves length extrapolation by greedily reusing pretrained positional intervals and interpolating attention logit to eliminate outliers. GALI achieves stable and superior performance across a wide range of long-context tasks without requiring input-length-specific tuning. Our analysis further reveals that LLMs interpret positional intervals unevenly and that restricting interpolation to narrower ranges improves performance, even on short-context tasks. GALI represents a step toward more robust and generalizable long-text processing in LLMs. Our implementation of GALI, along with the experiments from our paper, is open-sourced at https://github.com/adlnlp/Gali.

摘要

基于Transformer架构的大语言模型(LLMs)在处理超出训练上下文窗口长度的输入时,会因位置分布外(O.O.D.)问题导致注意力机制失效。现有解决方案包括微调方法和免训练方法,但存在效率低下、冗余插值、逻辑异常值或局部位置信息丢失等挑战。我们提出贪婪注意力逻辑插值法(GALI),这种免训练方法通过贪婪复用预训练位置区间并对注意力逻辑值进行插值以消除异常值,从而改进长度外推性能。GALI在各类长上下文任务中均能实现稳定优异的性能,且无需针对输入长度进行专门调参。进一步分析表明,大语言模型对位置区间的解读具有非均匀性,限制插值范围可提升性能,即使在短上下文任务中亦然。GALI为大语言模型实现更鲁棒、更通用的长文本处理迈出了重要一步。我们在https://github.com/adlnlp/Gali开源了GALI的实现代码及论文实验数据。


A Statistical Framework for Ranking LLM-Based Chatbots

Abstract

arXiv:2412.18407v2 Announce Type: replace-cross Abstract: Large language models (LLMs) have transformed natural language processing, with frameworks like Chatbot Arena providing pioneering platforms for evaluating these models. By facilitating millions of pairwise comparisons based on human judgments, Chatbot Arena has become a cornerstone in LLM evaluation, offering rich datasets for ranking models in open-ended conversational tasks. Building upon this foundation, we propose a statistical framework that incorporates key advancements to address specific challenges in pairwise comparison analysis. First, we introduce a factored tie model that enhances the ability to handle ties -- an integral aspect of human-judged comparisons -- significantly improving the model's fit to observed data. Second, we extend the framework to model covariance between competitors, enabling deeper insights into performance relationships and facilitating intuitive groupings into performance tiers. Third, we resolve optimization challenges arising from parameter non-uniqueness by introducing novel constraints, ensuring stable and interpretable parameter estimation. Through rigorous evaluation and extensive experimentation, our framework demonstrates substantial improvements over existing methods in modeling pairwise comparison data. To support reproducibility and practical adoption, we release leaderbot, an open-source Python package implementing our models and analyses.

摘要

大型语言模型(LLMs)已彻底改变自然语言处理领域,其中Chatbot Arena等评估框架为模型性能测评提供了开创性平台。通过基于人类判断的数百万次两两比较,该平台已成为开放域对话任务中模型排序的重要基准,并为排名研究提供丰富数据集。在此基础之上,我们提出一个包含关键改进的统计框架以解决成对比较分析中的特定挑战。首先,我们引入因子化平局模型,显著提升对人工评判中固有平局现象的处理能力,从而大幅改善模型对观测数据的拟合度。其次,我们扩展框架以建模竞争者间的协方差关系,既可深入揭示性能关联,又能实现性能等级的直观分组。第三,我们通过引入新型约束条件解决参数非唯一性导致的优化难题,确保参数估计的稳定性和可解释性。经严格评估与大量实验验证,本框架在成对比较数据建模方面较现有方法展现出显著提升。为支持研究复现与实际应用,我们发布了实现全部模型与分析的开源Python工具包leaderbot。


SparQLe: Speech Queries to Text Translation Through LLMs

Abstract

arXiv:2502.09284v3 Announce Type: replace-cross Abstract: With the growing influence of Large Language Models (LLMs), there is increasing interest in integrating speech representations with them to enable more seamless multi-modal processing and speech understanding. This study introduces a novel approach that combines self-supervised speech representations with instruction-tuned LLMs for speech-to-text translation. The proposed approach leverages a modality adapter to align extracted speech features with instruction-tuned LLMs using English speech data. Our experiments demonstrate that this method effectively preserves the semantic content of the input speech and serves as an effective bridge between self-supervised speech models and instruction-tuned LLMs, offering a promising approach for various speech understanding applications.

摘要

随着大型语言模型(LLMs)影响力的日益扩大,研究者对将语音表征与其整合以实现更无缝的多模态处理与语音理解产生了浓厚兴趣。本研究提出了一种创新方法,将自监督语音表征与指令调优的LLMs相结合,用于语音到文本的翻译任务。该方案通过模态适配器,利用英语语音数据将提取的语音特征与指令调优的LLMs对齐。实验结果表明,该方法能有效保留输入语音的语义内容,成为连接自监督语音模型与指令调优LLMs的高效桥梁,为各类语音理解应用提供了具有前景的解决方案。


Safety Reasoning with Guidelines

Abstract

arXiv:2502.04040v2 Announce Type: replace-cross Abstract: Training safe LLMs remains a critical challenge. The most widely used method, Refusal Training (RT), struggles to generalize against various Out-of-Distribution (OOD) jailbreaking attacks. Although various advanced methods have been proposed to address this issue, we instead question whether OOD attacks inherently surpass the capability of vanilla RT. Evaluations using Best-of-N (BoN) reveal significant safety improvements as N increases, indicating models possess adequate latent safety knowledge but RT fails to consistently elicit it under OOD scenarios. Further domain adaptation analysis reveals that direct RT causes reliance on superficial shortcuts, resulting in non-generalizable representation mappings. Inspired by our findings, we propose training model to perform safety reasoning for each query. Specifically, we synthesize reasoning supervision aligned with specified guidelines that reflect diverse perspectives on safety knowledge. This encourages model to engage in deeper reasoning, explicitly eliciting and utilizing latent safety knowledge for each query. Extensive experiments show that our method significantly improves model generalization against OOD attacks.

摘要

训练安全的LLM仍然是一个关键挑战。最广泛使用的拒绝训练(RT)方法难以泛化应对各种分布外(OOD)越狱攻击。尽管已有多种先进方法试图解决该问题,但我们质疑OOD攻击是否本质上超越了基础RT的能力。通过最佳N采样(BoN)评估发现,随着N值增大模型安全性显著提升,这表明模型具备足够的潜在安全知识,但RT在OOD场景下无法稳定激发这些知识。进一步的领域适应分析表明,直接RT会导致模型依赖表面捷径,形成不可泛化的表征映射。基于这些发现,我们提出训练模型对每个查询进行安全推理。具体而言,我们合成符合特定安全准则的推理监督信号,这些准则体现了安全知识的多维视角。该方法促使模型进行深度推理,显式地激发并利用每个查询中的潜在安全知识。大量实验表明,我们的方法显著提升了模型对OOD攻击的泛化能力。


MuSC: Improving Complex Instruction Following with Multi-granularity Self-Contrastive Training

Abstract

arXiv:2502.11541v3 Announce Type: replace-cross Abstract: Complex instruction-following with elaborate constraints is imperative for Large Language Models (LLMs). While existing methods have constructed data for complex instruction alignment, they all rely on a more advanced model, especially GPT-4, limiting their application. In this paper, we propose a Multi-granularity Self-Contrastive Training (MuSC) framework, to improve the complex instruction alignment without relying on a stronger model. Our method is conducted on both coarse and fine granularity. On coarse-granularity, we construct constraint-aware preference data based on instruction decomposition and recombination. On fine-granularity, we perform token-aware preference optimization with dynamic token-level supervision. Our method is evaluated on open-sourced models, and experiment results show our method achieves significant improvement on both complex and general instruction-following benchmarks, surpassing previous self-alignment methods.

摘要

具备复杂指令遵循能力且满足精细约束条件对于大语言模型(LLMs)至关重要。现有方法虽已构建了复杂指令对齐数据,但均依赖于更先进的模型(尤其是GPT-4),限制了其应用范围。本文提出多粒度自对比训练框架(MuSC),在不依赖更强模型的情况下提升复杂指令对齐能力。该方法在粗粒度和细粒度两个层面展开:在粗粒度层面,通过指令分解与重组构建约束感知的偏好数据;在细粒度层面,采用动态词元级监督进行词元感知偏好优化。我们在开源模型上评估了该方法,实验结果表明其在复杂指令和通用指令遵循基准测试中均取得显著提升,超越了现有自对齐方法。


RaaS: Reasoning-Aware Attention Sparsity for Efficient LLM Reasoning

Abstract

arXiv:2502.11147v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) have demonstrated strong capabilities across various domains, with recent advancements in challenging reasoning tasks such as mathematics and programming. However, solving reasoning tasks often requires an LLM to generate long sequences, incurring O(N)O(N) time and memory complexities per token, where NN is the current sequence length. To reduce complexities, existing sparsity-based algorithms propose to retain Key-Value (KV) vectors, the intermediate representations of only the most critical tokens. However, these algorithms struggle with the "impossible trinity" of accuracy, time, and memory. For example, the state-of-the-art algorithm, Quest, achieves high accuracy with O(L)O(L) time but O(N)O(N) memory (LL is the cache budget, LNL \ll N). To address the "impossible trinity", in this paper, we identify a new attention pattern during the decode stage of reasoning tasks, where milestone tokens (analogous to lemmas in mathematical proofs) emerge, are utilized, and then become unimportant afterward. Based on this pattern, we propose a new algorithm RaaS that identifies milestone tokens and retains their KV vectors until they are no longer needed, achieving high accuracy with O(L)O(L) time and O(L)O(L) memory complexities.

摘要

大型语言模型(LLMs)已在多个领域展现出强大能力,尤其在数学和编程等复杂推理任务中取得显著进展。然而,解决推理任务通常需要模型生成长序列,导致每个标记的时间与内存复杂度高达O(N)O(N)NN为当前序列长度)。为降低复杂度,现有基于稀疏性的算法提出仅保留关键标记的中间表示——键值(KV)向量。但这些算法难以同时兼顾精度、时间和内存的"不可能三角":例如当前最先进的Quest算法虽能以O(L)O(L)时间(LL为缓存预算且LNL \ll N)实现高精度,却需O(N)O(N)内存。针对这一难题,本文发现推理任务解码阶段存在一种新型注意力模式:里程碑标记(类比数学证明中的引理)会先出现并被利用,随后失去重要性。基于此模式,我们提出RaaS算法,通过识别里程碑标记并保留其KV向量直至不再需要,最终以O(L)O(L)时间和O(L)O(L)内存复杂度实现高精度。


Repo2Run: Automated Building Executable Environment for Code Repository at Scale

Abstract

arXiv:2502.13681v3 Announce Type: replace-cross Abstract: Scaling up executable code data is significant for improving language models' software engineering capability. The intricate nature of the process makes it labor-intensive, time-consuming and expert-knowledge-dependent to build a large number of executable code repositories, limiting the scalability of existing work based on running tests. The primary bottleneck lies in the automated building of test environments for different repositories, which is an essential yet underexplored task. To mitigate the gap, we introduce Repo2Run, the first LLM-based agent aiming at automating the building of executable test environments for any repositories at scale. Specifically, given a code repository, Repo2Run iteratively builds the Docker image, runs unit tests based on the feedback of the building, and synthesizes the Dockerfile until the entire pipeline is executed successfully. The resulting Dockerfile can then be used to create Docker container environments for running code and tests. We created a benchmark containing 420 Python repositories with unit tests for evaluation. The results illustrate that Repo2Run achieves an 86.0% success rate, outperforming SWE-agent by 77.0%. The resources of Repo2Run are available at https://github.com/bytedance/Repo2Run.


LLM Agents Making Agent Tools

Abstract

arXiv:2502.11705v2 Announce Type: replace-cross Abstract: Tool use has turned large language models (LLMs) into powerful agents that can perform complex multi-step tasks by dynamically utilising external software components. However, these tools must be implemented in advance by human developers, hindering the applicability of LLM agents in domains demanding large numbers of highly specialised tools, like in life sciences and medicine. Motivated by the growing trend of scientific studies accompanied by public code repositories, we propose ToolMaker, an agentic framework that autonomously transforms papers with code into LLM-compatible tools. Given a GitHub URL and short task description, ToolMaker autonomously installs dependencies and generates code to perform the task, using a closed-loop self-correction mechanism for debugging. To evaluate our approach, we introduce a benchmark comprising 15 complex computational tasks spanning various domains with over 100 unit tests to assess correctness and robustness. Our method correctly implements 80% of the tasks, substantially outperforming current state-of-the-art software engineering agents. ToolMaker therefore is a step towards fully autonomous agent-based scientific workflows. Our code and benchmark are publicly available at https://github.com/KatherLab/ToolMaker.

摘要

工具使用使大语言模型(LLM)成为能够通过动态调用外部软件组件执行复杂多步任务的强大智能体。然而,这些工具必须由人类开发者预先实现,这限制了LLM智能体在需要大量高度专业化工具的领域(如生命科学和医学)的适用性。受科学研究伴随公开代码库趋势的启发,我们提出了ToolMaker——一种能够自主将附带代码的论文转化为LLM兼容工具的智能框架。给定GitHub URL和简短任务描述,ToolMaker可自主安装依赖项并生成执行任务的代码,同时采用闭环自校正机制进行调试。为评估该方法,我们构建了一个包含15个跨领域复杂计算任务的基准测试集,通过100余项单元测试验证正确性与鲁棒性。我们的方法成功实现了80%的任务,显著优于当前最先进的软件工程智能体。ToolMaker因此向基于智能体的全自主科学工作流迈出了重要一步。代码与基准测试集已公开于https://github.com/KatherLab/ToolMaker。


Flow-of-Options: Diversified and Improved LLM Reasoning by Thinking Through Options

Abstract

arXiv:2502.12929v2 Announce Type: replace-cross Abstract: We present a novel reasoning approach called Flow-of-Options (FoO), designed to address intrinsic biases in Large Language Models (LLMs). Flow-of-Options enables LLMs to systematically explore a diverse range of possibilities in their reasoning, as demonstrated by an FoO-based agentic framework developed for autonomously solving Machine Learning (ML) tasks. FoO enforces diversity in LLM solutions through compressed and interpretable task representations, resulting in improvements of 38.2% - 69.2% on standard data science tasks, and 37.4% - 47.9% on therapeutic chemistry tasks, as compared to state-of-the-art baselines. With an overall operation cost under $1 per task, our framework is well-suited for cost-sensitive applications. Going beyond tabular classification and regression, we show the broader applicability of our FoO-based agentic system to tasks such as reinforcement learning and image generation. Our code is open-sourced at: https://github.com/flagshippioneering/Flow-of-Options.

摘要

我们提出了一种名为"选项流"(Flow-of-Options,FoO)的新型推理方法,旨在解决大型语言模型(LLM)中存在的固有偏差。该方法使LLM能够在推理过程中系统地探索多种可能性,这一点通过我们开发的基于FoO的自主求解机器学习(ML)任务的智能框架得到验证。FoO通过压缩且可解释的任务表征来确保LLM解决方案的多样性,相较于最先进的基线方法,在标准数据科学任务上实现了38.2%-69.2%的性能提升,在治疗化学任务上实现了37.4%-47.9%的提升。我们的框架单任务运行成本低于1美元,非常适合成本敏感型应用。除表格分类和回归任务外,我们还展示了基于FoO的智能系统在强化学习和图像生成等任务中的更广泛适用性。相关代码已开源:https://github.com/flagshippioneering/Flow-of-Options。


BaxBench: Can LLMs Generate Correct and Secure Backends?

Abstract

arXiv:2502.11844v3 Announce Type: replace-cross Abstract: Automatic program generation has long been a fundamental challenge in computer science. Recent benchmarks have shown that large language models (LLMs) can effectively generate code at the function level, make code edits, and solve algorithmic coding tasks. However, to achieve full automation, LLMs should be able to generate production-quality, self-contained application modules. To evaluate the capabilities of LLMs in solving this challenge, we introduce BaxBench, a novel evaluation benchmark consisting of 392 tasks for the generation of backend applications. We focus on backends for three critical reasons: (i) they are practically relevant, building the core components of most modern web and cloud software, (ii) they are difficult to get right, requiring multiple functions and files to achieve the desired functionality, and (iii) they are security-critical, as they are exposed to untrusted third-parties, making secure solutions that prevent deployment-time attacks an imperative. BaxBench validates the functionality of the generated applications with comprehensive test cases, and assesses their security exposure by executing end-to-end exploits. Our experiments reveal key limitations of current LLMs in both functionality and security: (i) even the best model, OpenAI o1, achieves a mere 62% on code correctness; (ii) on average, we could successfully execute security exploits on around half of the correct programs generated by each LLM; and (iii) in less popular backend frameworks, models further struggle to generate correct and secure applications. Progress on BaxBench signifies important steps towards autonomous and secure software development with LLMs.

摘要

自动程序生成长期以来一直是计算机科学领域的基础性挑战。最新基准测试表明,大型语言模型(LLMs)能够有效生成函数级代码、进行代码编辑并解决算法编码任务。然而要实现完全自动化,LLMs需要具备生成生产级、自包含应用模块的能力。为评估LLMs应对这一挑战的能力,我们提出了BaxBench——一个包含392个后端应用生成任务的新型评估基准。我们聚焦后端应用基于三个关键原因:(i)其具有实际相关性,构成大多数现代网络和云软件的核心组件;(ii)实现难度高,需要协调多个函数和文件才能达成预期功能;(iii)安全性至关重要,由于需面向不可信的第三方开放,必须确保解决方案能防范部署阶段的攻击。BaxBench通过全面测试用例验证生成应用的功能性,并通过端到端漏洞攻击评估其安全风险。实验揭示了当前LLMs在功能性和安全性方面的关键局限:(i)性能最佳的OpenAI o1模型在代码正确性上仅达62%;(ii)平均而言,我们能在各LLM生成的正确程序中成功执行约半数安全攻击;(iii)在冷门后端框架中,模型更难生成正确且安全的应用程序。BaxBench的进展标志着LLMs在实现自主安全软件开发道路上迈出了重要步伐。


TETRIS: Optimal Draft Token Selection for Batch Speculative Decoding

Abstract

arXiv:2502.15197v2 Announce Type: replace-cross Abstract: We propose TETRIS, a novel method that optimizes the total throughput of batch speculative decoding in multi-request settings. Unlike existing methods that optimize for a single request or a group of requests as a whole, TETRIS actively selects the most promising draft tokens (for every request in a batch) to be accepted when verified in parallel, resulting in fewer rejected tokens and hence less wasted computing resources. Such an effective resource utilization to achieve fast inference in large language models (LLMs) is especially important to service providers with limited inference capacity. Compared to baseline speculative decoding, TETRIS yields a consistently higher acceptance rate and more effective utilization of the limited inference capacity. We show theoretically and empirically that TETRIS outperforms baseline speculative decoding and existing methods that dynamically select draft tokens, leading to a more efficient batch inference in LLMs.

摘要

我们提出了一种名为TETRIS的新方法,用于优化多请求场景下批量推测解码的总吞吐量。与现有方法(仅针对单个请求或整个请求组进行优化)不同,TETRIS主动选择批次中每个请求最具潜力的候选标记进行并行验证,从而减少被拒绝的标记数量,降低计算资源浪费。这种实现大型语言模型(LLM)快速推理的有效资源利用方式,对于推理能力有限的服务提供商尤为重要。相较于基准推测解码方法,TETRIS能持续产生更高的接受率,并更有效地利用有限推理能力。我们通过理论分析和实验验证表明,TETRIS在动态选择候选标记方面优于基准推测解码和现有方法,可实现LLM更高效的批量推理。


Middle-Layer Representation Alignment for Cross-Lingual Transfer in Fine-Tuned LLMs

Abstract

arXiv:2502.14830v2 Announce Type: replace-cross Abstract: While large language models demonstrate remarkable capabilities at task-specific applications through fine-tuning, extending these benefits across diverse languages is essential for broad accessibility. However, effective cross-lingual transfer is hindered by LLM performance gaps across languages and the scarcity of fine-tuning data in many languages. Through analysis of LLM internal representations from over 1,000+ language pairs, we discover that middle layers exhibit the strongest potential for cross-lingual alignment. Building on this finding, we propose a middle-layer alignment objective integrated into task-specific training. Our experiments on slot filling, machine translation, and structured text generation show consistent improvements in cross-lingual transfer, especially to lower-resource languages. The method is robust to the choice of alignment languages and generalizes to languages unseen during alignment. Furthermore, we show that separately trained alignment modules can be merged with existing task-specific modules, improving cross-lingual capabilities without full re-training. Our code is publicly available (https://github.com/dannigt/mid-align).

摘要

虽然大型语言模型通过微调在特定任务应用中展现出卓越性能,但将这些优势扩展到多种语言对于实现广泛可及性至关重要。然而,跨语言迁移的有效性受到两大制约:不同语言间LLM性能差异显著,以及多数语言微调数据匮乏。通过对1000多种语言对的LLM内部表征进行分析,我们发现中间层具有最强的跨语言对齐潜力。基于这一发现,我们提出了一种集成于任务特定训练中的中间层对齐目标。在槽填充、机器翻译和结构化文本生成任务上的实验表明,该方法能持续提升跨语言迁移效果,尤其对低资源语言效果显著。该方法对对齐语言的选择具有鲁棒性,并能泛化至对齐阶段未见的语言。此外,我们还证明单独训练的对齐模块可与现有任务专用模块合并,无需完整重新训练即可增强跨语言能力。代码已开源(https://github.com/dannigt/mid-align)。


Recurrent Knowledge Identification and Fusion for Language Model Continual Learning

Abstract

arXiv:2502.17510v2 Announce Type: replace-cross Abstract: Continual learning (CL) is crucial for deploying large language models (LLMs) in dynamic real-world environments without costly retraining. While recent model ensemble and model merging methods guided by parameter importance have gained popularity, they often struggle to balance knowledge transfer and forgetting, mainly due to the reliance on static importance estimates during sequential training. In this paper, we present Recurrent-KIF, a novel CL framework for Recurrent Knowledge Identification and Fusion, which enables dynamic estimation of parameter importance distributions to enhance knowledge transfer. Inspired by human continual learning, Recurrent-KIF employs an inner loop that rapidly adapts to new tasks while identifying important parameters, coupled with an outer loop that globally manages the fusion of new and historical knowledge through redundant knowledge pruning and key knowledge merging. These inner-outer loops iteratively perform multiple rounds of fusion, allowing Recurrent-KIF to leverage intermediate training information and adaptively adjust fusion strategies based on evolving importance distributions. Extensive experiments on two CL benchmarks with various model sizes (from 770M to 13B) demonstrate that Recurrent-KIF effectively mitigates catastrophic forgetting and enhances knowledge transfer.

摘要

持续学习(CL)对于在动态现实环境中部署大型语言模型(LLM)而无需昂贵重新训练至关重要。尽管当前基于参数重要性的模型集成与模型融合方法日益流行,但这些方法往往难以平衡知识迁移与遗忘问题,主要归因于序列训练过程中对静态重要性评估的依赖。本文提出Recurrent-KIF——一种面向循环知识识别与融合的新型CL框架,通过动态估计参数重要性分布以增强知识迁移能力。受人类持续学习机制启发,Recurrent-KIF采用内循环快速适应新任务并识别重要参数,外循环则通过冗余知识剪枝与关键知识合并全局管理新旧知识融合。这种内外循环通过多轮迭代融合,使模型能够利用中间训练信息,并基于动态演化的重要性分布自适应调整融合策略。在两种CL基准测试(模型规模770M至13B)上的大量实验表明,Recurrent-KIF能有效缓解灾难性遗忘并提升知识迁移效率。


DistiLLM-2: A Contrastive Approach Boosts the Distillation of LLMs

Abstract

arXiv:2503.07067v2 Announce Type: replace-cross Abstract: Despite the success of distillation in large language models (LLMs), most prior work applies identical loss functions to both teacher- and student-generated data. These strategies overlook the synergy between loss formulations and data types, leading to a suboptimal performance boost in student models. To address this, we propose DistiLLM-2, a contrastive approach that simultaneously increases the likelihood of teacher responses and decreases that of student responses by harnessing this synergy. Our extensive experiments show that DistiLLM-2 not only builds high-performing student models across a wide range of tasks, including instruction-following and code generation, but also supports diverse applications, such as preference alignment and vision-language extensions. These findings highlight the potential of a contrastive approach to enhance the efficacy of LLM distillation by effectively aligning teacher and student models across varied data types.

摘要

尽管蒸馏方法在大型语言模型(LLMs)中取得了成功,但先前研究大多对教师模型和学生模型生成的数据采用相同的损失函数。这些策略忽视了损失函数与数据类型之间的协同作用,导致学生模型的性能提升受限。为此,我们提出DistiLLM-2——一种通过利用这种协同效应,同时提高教师响应概率并降低学生响应概率的对比方法。大量实验表明,DistiLLM-2不仅能构建在指令跟随、代码生成等广泛任务中表现优异的学生模型,还支持偏好对齐和视觉语言扩展等多样化应用。这些发现凸显了对比方法通过有效协调不同数据类型下的教师模型与学生模型,从而提升LLM蒸馏效能的潜力。


Beyond Prompting: An Efficient Embedding Framework for Open-Domain Question Answering

Abstract

arXiv:2503.01606v2 Announce Type: replace-cross Abstract: Large language models have recently pushed open domain question answering (ODQA) to new frontiers. However, prevailing retriever-reader pipelines often depend on multiple rounds of prompt level instructions, leading to high computational overhead, instability, and suboptimal retrieval coverage. In this paper, we propose EmbQA, an embedding-level framework that alleviates these shortcomings by enhancing both the retriever and the reader. Specifically, we refine query representations via lightweight linear layers under an unsupervised contrastive learning objective, thereby reordering retrieved passages to highlight those most likely to contain correct answers. Additionally, we introduce an exploratory embedding that broadens the model's latent semantic space to diversify candidate generation and employs an entropy-based selection mechanism to choose the most confident answer automatically. Extensive experiments across three open-source LLMs, three retrieval methods, and four ODQA benchmarks demonstrate that EmbQA substantially outperforms recent baselines in both accuracy and efficiency.

摘要

大型语言模型近期将开放域问答(ODQA)推向了新高度。然而,当前主流的检索-阅读器流水线通常依赖多轮提示级指令,导致计算开销高、稳定性差且检索覆盖率欠佳。本文提出EmbQA框架,通过增强检索器和阅读器来缓解上述缺陷。具体而言,我们在无监督对比学习目标下,通过轻量级线性层优化查询表示,从而对检索到的段落重新排序,突出最可能包含正确答案的文本。此外,我们引入探索性嵌入以扩展模型的潜在语义空间,从而增加候选答案的多样性,并采用基于熵的选择机制自动筛选置信度最高的答案。在三种开源大语言模型、三种检索方法和四个ODQA基准测试上的大量实验表明,EmbQA在准确性和效率方面均显著优于现有基线方法。


Abstract

arXiv:2503.01372v2 Announce Type: replace-cross Abstract: In Switzerland legal translation is uniquely important due to the country's four official languages and requirements for multilingual legal documentation. However, this process traditionally relies on professionals who must be both legal experts and skilled translators -- creating bottlenecks and impacting effective access to justice. To address this challenge, we introduce SwiLTra-Bench, a comprehensive multilingual benchmark of over 180K aligned Swiss legal translation pairs comprising laws, headnotes, and press releases across all Swiss languages along with English, designed to evaluate LLM-based translation systems. Our systematic evaluation reveals that frontier models achieve superior translation performance across all document types, while specialized translation systems excel specifically in laws but under-perform in headnotes. Through rigorous testing and human expert validation, we demonstrate that while fine-tuning open SLMs significantly improves their translation quality, they still lag behind the best zero-shot prompted frontier models such as Claude-3.5-Sonnet. Additionally, we present SwiLTra-Judge, a specialized LLM evaluation system that aligns best with human expert assessments.

摘要

为解决该问题,我们推出SwiLTra-Bench——一个包含逾18万组对齐文本的综合性多语种基准数据集,涵盖瑞士所有官方语言及英语的法律条文、判决摘要和新闻公告,专为评估基于大语言模型的翻译系统而设计。系统评估表明:前沿模型在所有文本类型中均表现优异,而专业翻译系统虽在法律条文上表现突出,却在判决摘要中逊色。通过严格测试与专家人工验证,我们发现尽管对开源小模型进行微调能显著提升其翻译质量,但仍落后于Claude-3.5-Sonnet等最佳零样本提示前沿模型。此外,我们开发了SwiLTra-Judge专业评估系统,其判断结果与人类专家评估具有最高一致性。


Image Captioning Evaluation in the Age of Multimodal LLMs: Challenges and Future Perspectives

Abstract

arXiv:2503.14604v2 Announce Type: replace-cross Abstract: The evaluation of machine-generated image captions is a complex and evolving challenge. With the advent of Multimodal Large Language Models (MLLMs), image captioning has become a core task, increasing the need for robust and reliable evaluation metrics. This survey provides a comprehensive overview of advancements in image captioning evaluation, analyzing the evolution, strengths, and limitations of existing metrics. We assess these metrics across multiple dimensions, including correlation with human judgment, ranking accuracy, and sensitivity to hallucinations. Additionally, we explore the challenges posed by the longer and more detailed captions generated by MLLMs and examine the adaptability of current metrics to these stylistic variations. Our analysis highlights some limitations of standard evaluation approaches and suggests promising directions for future research in image captioning assessment.

摘要

机器生成图像描述的评估是一个复杂且不断发展的挑战。随着多模态大语言模型(MLLMs)的出现,图像描述生成已成为核心任务,这增强了对鲁棒且可靠评估指标的需求。本综述全面概述了图像描述评估的进展,分析了现有指标的演变、优势与局限性。我们从多维度评估这些指标,包括与人类判断的相关性、排序准确性以及对幻觉的敏感性。此外,我们探讨了MLLMs生成的更长、更详细描述所带来的挑战,并检验了当前指标对这些风格变化的适应性。我们的分析揭示了标准评估方法的一些局限性,并为图像描述评估的未来研究方向提出了潜在路径。


WaferLLM: Large Language Model Inference at Wafer Scale

Abstract

arXiv:2502.04563v3 Announce Type: replace-cross Abstract: Emerging AI accelerators increasingly adopt wafer-scale manufacturing technologies, integrating hundreds of thousands of AI cores in a mesh architecture with large distributed on-chip memory (tens of GB in total) and ultra-high on-chip memory bandwidth (tens of PB/s). However, current LLM inference systems, optimized for shared memory architectures like GPUs, fail to exploit these accelerators fully. We introduce WaferLLM, the first wafer-scale LLM inference system. WaferLLM is guided by a novel PLMR model (pronounced as "Plummer") that captures the unique hardware characteristics of wafer-scale architectures. Leveraging this model, WaferLLM pioneers wafer-scale LLM parallelism, optimizing the utilization of hundreds of thousands of on-chip cores. It also introduces MeshGEMM and MeshGEMV, the first GEMM and GEMV implementations designed to scale effectively on wafer-scale accelerators. Evaluations show that WaferLLM achieves up to 200×\times higher accelerator utilization than state-of-the-art methods. Leveraging a wafer-scale accelerator (Cerebras WSE2), WaferLLM delivers GEMV operations 606×\times faster and 16×\times more energy-efficient than on an NVIDIA A100 GPU. For full LLM inference, WaferLLM achieves 10-20×\times speedups over A100 GPU clusters running SGLang and vLLM. These advantages are expected to grow as wafer-scale AI models, software, and hardware continue to mature. WaferLLM is open-sourced at https://github.com/MeshInfra/WaferLLM.

摘要

新兴AI加速器日益采用晶圆级制造技术,通过网状架构集成数十万个AI核心,配备大型分布式片上内存(总量达数十GB)和超高片上内存带宽(总量达数十PB/s)。然而,当前针对GPU等共享内存架构优化的LLM推理系统无法充分利用这些加速器。 我们提出WaferLLM——首个晶圆级LLM推理系统。该系统以创新的PLMR模型(发音同"Plummer")为指导,该模型能精准捕捉晶圆级架构的独特硬件特性。基于此模型,WaferLLM开创了晶圆级LLM并行技术,优化数十万片上核心的利用率,并首次实现专为晶圆级加速器设计的MeshGEMM和MeshGEMV运算。 评估表明,WaferLLM的加速器利用率较现有最优方法提升高达200倍。在晶圆级加速器(Cerebras WSE2)上,其GEMV运算速度较NVIDIA A100 GPU快606倍,能效高16倍。对于完整LLM推理,WaferLLM比运行SGLang和vLLM的A100 GPU集群快10-20倍。随着晶圆级AI模型、软件和硬件的持续成熟,这些优势预计将进一步扩大。WaferLLM已开源:https://github.com/MeshInfra/WaferLLM。


HelpSteer3: Human-Annotated Feedback and Edit Data to Empower Inference-Time Scaling in Open-Ended General-Domain Tasks

Abstract

arXiv:2503.04378v2 Announce Type: replace-cross Abstract: Inference-Time Scaling has been critical to the success of recent models such as OpenAI o1 and DeepSeek R1. However, many techniques used to train models for inference-time scaling require tasks to have answers that can be verified, limiting their application to domains such as math, coding and logical reasoning. We take inspiration from how humans make first attempts, ask for detailed feedback from others and make improvements based on such feedback across a wide spectrum of open-ended endeavors. To this end, we collect HelpSteer3 data to train dedicated Feedback and Edit Models that are capable of performing inference-time scaling for open-ended general-domain tasks. In our setup, one model generates an initial response, which are given feedback by a second model, that are then used by a third model to edit the response. We show that performance on Arena Hard, a benchmark strongly predictive of Chatbot Arena Elo can be boosted by scaling the number of initial response drafts, effective feedback and edited responses. When scaled optimally, our setup based on 70B models from the Llama 3 family can reach SoTA performance on Arena Hard at 92.7 as of 5 Mar 2025, surpassing OpenAI o1-preview-2024-09-12 with 90.4 and DeepSeek R1 with 92.3.

摘要

推理时缩放技术已成为近期模型(如OpenAI o1和DeepSeek R1)成功的关键因素。然而,许多用于训练支持推理时缩放的模型技术要求任务答案具备可验证性,这限制了其在数学、编程和逻辑推理等领域的应用。我们从人类在开放式任务中的行为模式获得启发:先进行初步尝试,向他人获取详细反馈,并根据反馈进行改进。为此,我们收集HelpSteer3数据集,训练专用的反馈模型和编辑模型,使其能够对开放式通用领域任务实施推理时缩放。在我们的框架中,首个模型生成初始响应,第二个模型提供反馈,第三个模型基于反馈修改响应。实验表明,通过增加初始响应草稿数量、优化反馈质量和改进响应次数,可显著提升在Arena Hard基准(该基准对Chatbot Arena Elo评分具有强预测性)上的表现。当采用最优缩放策略时,基于Llama 3系列700亿参数模型的系统在2025年3月5日达到Arena Hard 92.7分的当前最先进水平,超越OpenAI o1-preview-2024-09-12(90.4分)和DeepSeek R1(92.3分)。


FactSelfCheck: Fact-Level Black-Box Hallucination Detection for LLMs

Abstract

arXiv:2503.17229v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) frequently generate hallucinated content, posing significant challenges for applications where factuality is crucial. While existing hallucination detection methods typically operate at the sentence level or passage level, we propose FactSelfCheck, a novel black-box sampling-based method that enables fine-grained fact-level detection. Our approach represents text as knowledge graphs consisting of facts in the form of triples. Through analyzing factual consistency across multiple LLM responses, we compute fine-grained hallucination scores without requiring external resources or training data. Our evaluation demonstrates that FactSelfCheck performs competitively with leading sentence-level sampling-based methods while providing more detailed insights. Most notably, our fact-level approach significantly improves hallucination correction, achieving a 35.5% increase in factual content compared to the baseline, while sentence-level SelfCheckGPT yields only a 10.6% improvement. The granular nature of our detection enables more precise identification and correction of hallucinated content. Additionally, we contribute a new dataset for evaluating sampling-based methods - FavaMultiSamples.

摘要

大型语言模型(LLMs)经常生成虚假内容,这对事实性要求严格的应用场景构成重大挑战。现有幻觉检测方法通常局限于句子或段落层面,本文提出FactSelfCheck——一种基于采样的新型黑盒检测方法,可实现细粒度的事实级检测。我们的方法将文本表示为由三元组事实构成的知识图谱,通过分析多个LLM响应间的事实一致性,无需外部资源或训练数据即可计算细粒度幻觉分数。评估表明,FactSelfCheck与领先的句子级采样方法性能相当,同时能提供更精细的分析。尤为突出的是,我们的事实级方法显著提升了幻觉修正效果,相比基线模型使事实内容增加35.5%,而句子级SelfCheckGPT仅提升10.6%。这种细粒度检测机制能更精准地识别和修正虚假内容。此外,我们还贡献了一个用于评估采样方法的新数据集FavaMultiSamples。


LLM Benchmarking with LLaMA2: Evaluating Code Development Performance Across Multiple Programming Languages

Abstract

arXiv:2503.19217v2 Announce Type: replace-cross Abstract: The rapid evolution of large language models (LLMs) has opened new possibilities for automating various tasks in software development. This paper evaluates the capabilities of the Llama 2-70B model in automating these tasks for scientific applications written in commonly used programming languages. Using representative test problems, we assess the model's capacity to generate code, documentation, and unit tests, as well as its ability to translate existing code between commonly used programming languages. Our comprehensive analysis evaluates the compilation, runtime behavior, and correctness of the generated and translated code. Additionally, we assess the quality of automatically generated code, documentation and unit tests. Our results indicate that while Llama 2-70B frequently generates syntactically correct and functional code for simpler numerical tasks, it encounters substantial difficulties with more complex, parallelized, or distributed computations, requiring considerable manual corrections. We identify key limitations and suggest areas for future improvements to better leverage AI-driven automation in scientific computing workflows.

摘要

大型语言模型(LLMs)的快速发展为软件开发中各类任务的自动化提供了新的可能性。本文评估了Llama 2-70B模型在针对常用编程语言编写的科学应用实现任务自动化方面的能力。通过代表性测试问题,我们考察了该模型生成代码、文档和单元测试的能力,以及在不同常用编程语言间转换现有代码的表现。我们的综合分析评估了生成代码和转换代码的编译情况、运行时行为及正确性,同时对自动生成的代码、文档和单元测试质量进行了评价。研究结果表明,虽然Llama 2-70B在简单数值计算任务中经常能生成语法正确且功能正常的代码,但在处理更复杂的并行化或分布式计算时存在显著困难,需要大量人工修正。我们指出了关键局限性,并为未来改进提出了建议方向,以更好地利用人工智能驱动自动化技术优化科学计算工作流程。


LEMMA: Learning from Errors for MatheMatical Advancement in LLMs

Abstract

arXiv:2503.17439v2 Announce Type: replace-cross Abstract: Large language models (LLMs) have demonstrated remarkable reasoning capability in solving mathematical problems. However, existing approaches primarily focus on improving the quality of correct training data, e.g., distilling high-quality correct solutions from advanced models, neglecting the value contained in error data, potentially hindering the model's reflective ability. Though some studies attempt to leverage error data, they often involve complex mechanisms, such as Monte Carlo Tree Search (MCTS) to explore error nodes. In this work, we propose to enhance LLMs' reasoning ability by Learning from Errors for Mathematical Advancement (LEMMA). LEMMA constructs data consisting of an incorrect solution with an erroneous step and a reflection connection to a correct solution for fine-tuning. Specifically, we systematically analyze the model-generated error types and introduce an error-type grounded mistake augmentation method to collect diverse and representative errors. Correct solutions are either from fixing the errors or generating a fresh start. Through a model-aware smooth reflection connection, the erroneous solution is transferred to the correct one. By fine-tuning on the constructed dataset, the model is able to self-correct errors autonomously within the generation process without relying on external critique models. Experimental results demonstrate that LEMMA achieves significant performance improvements over other strong baselines.

摘要

大型语言模型(LLMs)在解决数学问题方面展现出卓越的推理能力。然而,现有方法主要集中于提升正确训练数据的质量,例如通过从先进模型中提炼高质量正确解法,却忽视了错误数据所蕴含的价值,这可能限制模型的反思能力。尽管部分研究尝试利用错误数据,但其机制往往较为复杂,例如采用蒙特卡洛树搜索(MCTS)来探索错误节点。本研究提出通过"从错误中学习以实现数学进阶"(LEMMA)来增强LLMs的推理能力。LEMMA构建了包含错误步骤的错误解法与通过反思连接至正确解法的微调数据集。具体而言,我们系统分析了模型生成的错误类型,并提出基于错误类型的错误增强方法以收集多样且具代表性的错误。正确解法或通过修正错误获得,或重新生成全新解法。借助模型感知的平滑反思连接,错误解法被转化为正确解法。通过在构建的数据集上进行微调,模型能够在生成过程中自主纠正错误,而无需依赖外部评判模型。实验结果表明,LEMMA相较于其他强基线模型实现了显著的性能提升。


NdLinear: Don't Flatten! Building Superior Neural Architectures by Preserving N-D Structure

Abstract

arXiv:2503.17353v2 Announce Type: replace-cross Abstract: Many high-impact machine learning tasks involve multi-dimensional data such as images, volumetric medical scans, and multivariate time-series. Yet, most neural architectures flatten these inputs, discarding critical cross-dimension information. We introduce \textbf&#123;NdLinear&#125;, a novel linear transformation that circumvents this destructive flattening by operating directly on tensors. NdLinear applies transformations separately along each data dimension, thereby preserving the native data structure. Extensive experiments demonstrate NdLinear's capacity to significantly enhance representational power, achieve dramatic parameter reductions (often by orders of magnitude), and maintain a favorable computational profile. For instance, when applied to Large Language Model finetuning, our \textbf&#123;NdLinear-LoRA&#125; delivers comparable or improved accuracy on reasoning tasks using up to 9×9\times fewer trainable parameters than standard LoRA. These broad advantages of NdLinear are consistently validated across diverse neural architectures (CNNs, RNNs, Transformers, MLPs) and data domains, including vision, language, time-series, and tabular tasks. As a versatile, drop-in replacement for standard linear layers, NdLinear processes data in its original N-dimensional form, offering a foundational component for developing more efficient and powerful next-generation neural architectures.

摘要

许多高影响力的机器学习任务涉及多维数据,如图像、体积医学扫描和多变量时间序列。然而,大多数神经网络架构会将这些输入展平,丢弃关键的跨维度信息。我们提出了一种新颖的线性变换方法\textbf&#123;NdLinear&#125;,该方法通过直接对张量进行操作来规避这种破坏性展平。NdLinear沿每个数据维度分别应用变换,从而保留原始数据结构。大量实验证明,NdLinear能够显著增强表示能力,实现参数量的急剧减少(通常达数量级),并保持优越的计算效率。例如,在大型语言模型微调中,我们的\textbf&#123;NdLinear-LoRA&#125;在推理任务上实现了与标准LoRA相当或更优的精度,同时使用的可训练参数最多减少9×9\times。NdLinear的这些广泛优势在多种神经网络架构(CNN、RNN、Transformer、MLP)和数据领域(包括视觉、语言、时间序列和表格任务)中得到了持续验证。作为一种可即插即用的标准线性层替代方案,NdLinear以原始N维形式处理数据,为开发更高效、更强大的下一代神经网络架构提供了基础组件。


TuRTLe: A Unified Evaluation of LLMs for RTL Generation

Abstract

arXiv:2504.01986v2 Announce Type: replace-cross Abstract: The rapid advancements in LLMs have driven the adoption of generative AI in various domains, including Electronic Design Automation (EDA). Unlike traditional software development, EDA presents unique challenges, as generated RTL code must not only be syntactically correct and functionally accurate but also synthesizable by hardware generators while meeting performance, power, and area constraints. These additional requirements introduce complexities that existing code-generation benchmarks often fail to capture, limiting their effectiveness in evaluating LLMs for RTL generation. To address this gap, we propose TuRTLe, a unified evaluation framework designed to systematically assess LLMs across key RTL generation tasks. TuRTLe integrates multiple existing benchmarks and automates the evaluation process, enabling a comprehensive assessment of LLM performance in syntax correctness, functional correctness, synthesis, PPA optimization, and exact line completion. Using this framework, we benchmark a diverse set of open LLMs and analyze their strengths and weaknesses in EDA-specific tasks. Our results show that reasoning-based models, such as DeepSeek R1, consistently outperform others across multiple evaluation criteria, but at the cost of increased computational overhead and inference latency. Additionally, base models are better suited in module completion tasks, while instruct-tuned models perform better in specification-to-RTL tasks.

摘要

大型语言模型(LLM)的快速发展推动了生成式人工智能在电子设计自动化(EDA)等领域的应用。与传统软件开发不同,EDA领域存在独特挑战:生成的寄存器传输级(RTL)代码不仅需要语法正确、功能精确,还必须能被硬件生成器综合,同时满足性能、功耗和面积(PPA)约束。这些附加要求带来了现有代码生成基准测试难以捕捉的复杂性,限制了其在评估LLM生成RTL能力方面的有效性。为此,我们提出TuRTLe评估框架,该系统化评估框架专为关键RTL生成任务设计。TuRTLe整合了多个现有基准测试并实现评估流程自动化,可全面评估LLM在语法正确性、功能正确性、可综合性、PPA优化及精确行补全等方面的表现。通过该框架,我们对多种开源LLM进行基准测试,并分析其在EDA特定任务中的优劣势。实验结果表明,基于推理的模型(如DeepSeek R1)在多项评估标准中持续领先,但需付出更高计算开销和推理延迟的代价。此外,基础模型更适用于模块补全任务,而指令调优模型在规范到RTL的转换任务中表现更优。


Inverse Reinforcement Learning with Dynamic Reward Scaling for LLM Alignment

Abstract

arXiv:2503.18991v3 Announce Type: replace-cross Abstract: Robust alignment is vital for safely deploying large language models (LLMs). Existing techniques are either reward-based -- training a reward model on preference pairs and optimizing with reinforcement learning (RL) -- or reward-free -- directly fine-tuning on ranked outputs. Recent research shows that well-tuned reward-based pipelines remain the most robust, and single-response demonstrations can outperform pairwise preference data. However, two key challenges remain: (i) imbalanced safety datasets that over-represent common hazards while neglecting long-tail threats; and (ii) static reward models that ignore task difficulty, limiting optimization efficiency and attainable gains. To address these limitations, we propose \textbf{DR-IRL}, which dynamically adjusts rewards through inverse reinforcement learning. We first construct a balanced safety dataset of seven harmful categories using Chain-of-Draft (CoD) template prompts, which reduce token usage and generation time compared to Chain-of-Thought (CoT). We then train category-specific reward models on this dataset via IRL. Finally, to align the LLM, we introduce \textbf{GRPO-S} (Group Relative Policy Optimization--Scaling), a variant of GRPO that scales the reward during optimization to task difficulty -- data-level hardness measured by CLIP similarity and model-level responsiveness measured by reward gaps. Extensive experiments on multiple benchmarks and LLMs demonstrate that DR-IRL outperforms all baselines in safety alignment while maintaining usefulness.

摘要

鲁棒对齐对于安全部署大语言模型(LLMs)至关重要。现有技术可分为两类:基于奖励的方法——通过偏好数据训练奖励模型并利用强化学习(RL)进行优化;以及无奖励方法——直接对排序输出进行微调。最新研究表明,经过精心调校的基于奖励流程仍具有最佳鲁棒性,且单响应演示数据可能优于成对偏好数据。然而仍存在两大挑战:(i) 安全数据集不平衡,过度表征常见风险而忽略长尾威胁;(ii) 静态奖励模型忽视任务难度,限制优化效率与收益上限。针对这些局限,我们提出\textbf{DR-IRL}框架,通过逆向强化学习动态调整奖励。首先基于"草案链"(CoD)模板提示构建包含七类危害的平衡安全数据集(相比"思维链"CoT可降低标记消耗与生成时间),继而通过IRL训练类别专属奖励模型。最后为对齐LLM,我们提出\textbf{GRPO-S}(分组相对策略优化-缩放),作为GRPO的改进版本,其根据任务难度(数据级难度通过CLIP相似度衡量,模型级响应性通过奖励差距评估)对奖励进行动态缩放。多基准测试与不同LLM上的实验表明,DR-IRL在保持实用性的同时,其安全对齐性能全面超越现有基线方法。


SuPreME: A Supervised Pre-training Framework for Multimodal ECG Representation Learning

Abstract

arXiv:2502.19668v2 Announce Type: replace-cross Abstract: Cardiovascular diseases are a leading cause of death and disability worldwide. Electrocardiogram (ECG) is critical for diagnosing and monitoring cardiac health, but obtaining large-scale annotated ECG datasets is labor-intensive and time-consuming. Recent ECG Self-Supervised Learning (eSSL) methods mitigate this by learning features without extensive labels but fail to capture fine-grained clinical semantics and require extensive task-specific fine-tuning. To address these challenges, we propose \textbf&#123;SuPreME&#125;, a \textbf&#123;Su&#125;pervised \textbf&#123;Pre&#125;-training framework for \textbf&#123;M&#125;ultimodal \textbf&#123;E&#125;CG representation learning. SuPreME is pre-trained using structured diagnostic labels derived from ECG report entities through a one-time offline extraction with Large Language Models (LLMs), which help denoise, standardize cardiac concepts, and improve clinical representation learning. By fusing ECG signals with textual cardiac queries instead of fixed labels, SuPreME enables zero-shot classification of unseen conditions without further fine-tuning. We evaluate SuPreME on six downstream datasets covering 106 cardiac conditions, achieving superior zero-shot AUC performance of 77.20%77.20\%, surpassing state-of-the-art eSSLs by 4.98%4.98\%. Results demonstrate SuPreME's effectiveness in leveraging structured, clinically relevant knowledge for high-quality ECG representations.

摘要

心血管疾病是全球范围内导致死亡和残疾的主要原因。心电图(ECG)对于诊断和监测心脏健康至关重要,但获取大规模标注的ECG数据集需要耗费大量人力和时间。近期出现的ECG自监督学习(eSSL)方法通过无标签特征学习缓解了这一问题,但未能捕捉细粒度的临床语义且需要大量任务特定微调。为解决这些挑战,我们提出\textbf&#123;SuPreME&#125;——一个基于\textbf&#123;监督预训练&#125;\textbf&#123;多模态ECG表征学习&#125;框架。该框架利用大型语言模型(LLMs)从ECG报告实体中一次性离线提取结构化诊断标签进行预训练,这些标签有助于去噪、标准化心脏概念并提升临床表征学习。通过将ECG信号与文本化心脏查询(而非固定标签)相融合,SuPreME无需微调即可实现未知病症的零样本分类。我们在涵盖106种心脏疾病的六个下游数据集上评估SuPreME,其零样本AUC性能达到77.20%77.20\%,较当前最优eSSL方法提升4.98%4.98\%。结果表明SuPreME能有效利用结构化的临床相关知识来获取高质量的ECG表征。


Learning to Reason Over Time: Timeline Self-Reflection for Improved Temporal Reasoning in Language Models

Abstract

arXiv:2504.05258v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) have emerged as powerful tools for generating coherent text, understanding context, and performing reasoning tasks. However, they struggle with temporal reasoning, which requires processing time-related information such as event sequencing, durations, and inter-temporal relationships. These capabilities are critical for applications including question answering, scheduling, and historical analysis. In this paper, we introduce TISER, a novel framework that enhances the temporal reasoning abilities of LLMs through a multi-stage process that combines timeline construction with iterative self-reflection. Our approach leverages test-time scaling to extend the length of reasoning traces, enabling models to capture complex temporal dependencies more effectively. This strategy not only boosts reasoning accuracy but also improves the traceability of the inference process. Experimental results demonstrate state-of-the-art performance across multiple benchmarks, including out-of-distribution test sets, and reveal that TISER enables smaller open-source models to surpass larger closed-weight models on challenging temporal reasoning tasks.

摘要

大型语言模型(LLMs)已成为生成连贯文本、理解上下文和执行推理任务的强大工具。然而,其在时间推理方面仍存在困难,该任务需要处理事件顺序、持续时间和跨时间关系等时序信息。这些能力对于问答、日程安排和历史分析等应用至关重要。本文提出TISER框架,通过结合时间线构建与迭代自反思的多阶段流程,显著增强LLMs的时间推理能力。我们的方法利用测试时扩展技术延长推理轨迹,使模型能更有效地捕捉复杂的时间依赖关系。该策略不仅提高了推理准确性,还增强了推断过程的可追溯性。实验结果表明,该方法在包括分布外测试集在内的多个基准上实现了最先进性能,并证明TISER能使较小的开源模型在具有挑战性的时间推理任务上超越更大的闭源权重模型。


Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning

Abstract

arXiv:2504.02922v2 Announce Type: replace-cross Abstract: Model diffing is the study of how fine-tuning changes a model's representations and internal algorithms. Many behaviors of interest are introduced during fine-tuning, and model diffing offers a promising lens to interpret such behaviors. Crosscoders are a recent model diffing method that learns a shared dictionary of interpretable concepts represented as latent directions in both the base and fine-tuned models, allowing us to track how concepts shift or emerge during fine-tuning. Notably, prior work has observed concepts with no direction in the base model, and it was hypothesized that these model-specific latents were concepts introduced during fine-tuning. However, we identify two issues which stem from the crosscoders L1 training loss that can misattribute concepts as unique to the fine-tuned model, when they really exist in both models. We develop Latent Scaling to flag these issues by more accurately measuring each latent's presence across models. In experiments comparing Gemma 2 2B base and chat models, we observe that the standard crosscoder suffers heavily from these issues. Building on these insights, we train a crosscoder with BatchTopK loss and show that it substantially mitigates these issues, finding more genuinely chat-specific and highly interpretable concepts. We recommend practitioners adopt similar techniques. Using the BatchTopK crosscoder, we successfully identify a set of chat-specific latents that are both interpretable and causally effective, representing concepts such as \textit&#123;false information&#125; and \textit&#123;personal question&#125;, along with multiple refusal-related latents that show nuanced preferences for different refusal triggers. Overall, our work advances best practices for the crosscoder-based methodology for model diffing and demonstrates that it can provide concrete insights into how chat-tuning modifies model behavior.

摘要

模型差异分析旨在研究微调过程如何改变模型的表征与内部算法。许多关键行为特性正是在微调过程中产生的,而模型差异分析为解释这些行为提供了有效视角。Crosscoders作为新兴的差异分析方法,通过构建基础模型与微调模型共享的可解释概念词典(表现为潜在方向),使我们能够追踪概念在微调过程中的演变轨迹。值得注意的是,先前研究发现某些概念在基础模型中缺乏对应方向,学者推测这些模型特定潜在方向是微调引入的新概念。然而,我们发现由于crosscoders的L1训练损失函数存在两个缺陷,可能导致将本存在于两个模型的概念错误归因为微调模型特有。为此,我们提出"潜在缩放"技术,通过更精确测量潜在方向在模型间的存在程度来识别这些问题。在Gemma 2 2B基础模型与聊天模型的对比实验中,标准crosscoder方法明显受这些问题影响。基于这些发现,我们采用BatchTopK损失函数训练crosscoder,证明该方法能有效缓解上述问题,发现更多真实且高度可解释的聊天模型特有概念。我们建议研究者采用类似技术。利用BatchTopK crosscoder,我们成功识别出一组兼具可解释性与因果有效性的聊天模型特有潜在方向,包括"虚假信息"、"个人问题"等概念,以及多个具有精细触发偏好的拒绝相关潜在方向。本研究推进了基于crosscoder的模型差异分析方法的最佳实践,并证明该方法能有效揭示聊天调优对模型行为的修改机制。